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ABSTRACT 

The success of ob jectives-based programs depends to a 
considerable extent on how effectively students and teachers assess 
mastery of objectives and make decisions for future instruction, 
llhile educators disagree on the usefulness of criterion-referenced 
tests the position taken in this monograph is that 

criterion-referenced tests are useful, and that their usefulness will 
be enhanced by developing testing methods and decision procedures 
specifically designed for their use within the context of 
ob jeccives-based programs* This monograph serves as a review and an 
integration of existing literature relating to the theory and 
practice of criterion- referenced testing with an emphasis on 
psychometric and statistical matters, and provides a foundation on 
which to design further research studies. Specifically, the material 
is organized around the following topics: Definitions of 
criterion-referenced tests and measurements, test development and 
validation, statistical issues in criterion-referenced measurement, 
selected psychometric issues, tailored testing research, description 
of a typical objectives-based program, and suggestions for further 
research. The two types of criterion-referenced tests focused on are: 
Estimation of "mastery scores" or "domain scores", and the allocation 
of individuals to "mastery states" on the objectives in a program* 
(Author/BJG) 
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With the need for significant changes in our elementary and secondary 
schools clearly documented by Project Talent data (Flanagan, Davis, 
Dailey, Shaycoft, Orr» Goldberg, & Neyman, 196A), we have seen the 
development and implementation of a diverse collection of alterna- 
tive educational programs that seek to improve the quality of educa* 
tion by individualizing instruction (Gibbons, 1970; Gronlund, 1974; 
Heathers, 1972). A common characteristic of many of the new programs 
is that the curriculum is defined in terms of instructional objec- 
tives; a program specified in such a way is referred to as objec- 
tives-based * The overall goal of an objectives-based instructional 
program is to provide an educational program which is maximally 
adaptive to the requirements of the individual learner. The 
instructional objectives specify the curriculum and serve as a basis 
for the development of curriculum materials and achievement tests. 
Among the best examples of objectives-based programs are Individually 
Prescribed Instruction (Glaser, 1968, 1970); Program for Learning in 



This material is an integration of previously published articles by 
the authors with several of their new contributions. In addition, 
an attempt was made to place the total material In a broader context 
of developments to the criterion-referenced testing field. 



Accordance with Needs (Flanagan, 1967, 1969) and the Individualized ' 
Mathematics Curriculum Project (DeVault, Krlewall, Buchanan, & 
Quilling, 1959) • 

Unfortunately, while considerable progress has been made in 
important areas such as the construction of Instructional materials, 
curriculum design, and computer management, until quite recently 
(Glaser & Nltko, 1971; Harris, Alkin, & Popharo, 197A; Millman, 1974) 
there have been few reliable guidelines for test construction, test 
assessment, and test score Interpretation, and this In turn has hampered 
effective implementation of the programs. One of the underlying pre- 
mises of objectives-based programs Is that effective Instruction de- 
pends. In part, on acknowledge of what skills the student has. It 
follows that the tests used to monitor student progress should be 
closely matched to the Instruction. Over the years, standard pro- 
cedures for testing and measurement within the context of traditional 
educational programs have become well-known to educators; however, 
the procedures are much less appropriate for use within objectives- 
based programs (Glaser, 1963; Hambleton & Novlck, 1973; Popham & 
Husek, 1969). 

As an alternative, we liave seen the introduction of criterion- 
referenced tests , which are intended to meet the testing and mea- 
surement requirements of the new objectives-based programs. In view 
of the Importance of criterion-referenced testing to the success of 
objectives-based programs, and their newness, it is perhaps not sur- 
prising to note the many articles written on the topic and that these 
articles typically reflect diverse points of view concerning cri- 
terion-referenced test definitions, methods of test development, 
assessment of psychometric properties, and so on. Now with the 
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important integrating works of Glaser and Nitko (1971), Millman (197A) , 
and Harris, et al > (197A) , terminology has been standardized, issues 
delineated, and many important technical developments identified* 

Purposes 

Clearly, the success of objectives-based programs depends to a 
considerable extent upon how effectively students and teachers assess 
mastery of objectives and make decisions for future instruction. 
While not all educat.rs agree on the usefulness of criterion-refer- 
enced tests (Block, 1971; Ebel, 1971), the position taken in this 
monograph is that criterion-referenced tests are useful, and that their 
usefulness will be enhanced by developing testing methods and deci- 
sion procedures specifically designed for their use within the con- 
text of objectives-based programs. Our monograph is intended to 
serve as a review and an integration of existing literature relating to 
the theory and practice of criterion-referenced testing with an em- 
phasis on psychometric and statistical matters, and to provide a solid 
foundation on which to design further research studies. Specifically, the 
material in the monograph is .organized around the following topics: Defi- 
nitions of criterion-referenced tests and measurements, test development 
and validation, statistical issues in criterion-referenced measurement, 
selected psychometric issues, tailored testing research, description 
of a typical objectives-based prog/-am, and suggestions for further re- 
search. Whereas there are a multitude of uses for criterion-refer- 
enced tests, we have chosen to provide a concentrated study in this 
monograph of only two: Estimation of "mastery scores** or "domain 
scores", and the allocation of individuals to "mastery states" on 
the objectives in a program. Both criterion-referenced test uses 
directly concern the day- to day management of students through an 
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objectives-based program. 

The monograph is intended to serve as a companion paper to the review 
by Kambleton (1974) on testing and decision-making procedures within sel- 
ected objectives-based programs, and to provide an expanded discussion of 
one of the four major areas of use of criterion-referenced tests described 
in the excellent monograph by Millman (1974). Millman indi- 
cated four major areas of use (needs assessment, individualized in- 
struction, program evaluation, and teacher improvement and personnel 
evaluation) and there may be others. However, we have limited our 
discussion to the use of criterion-referenced tests within the context 
of individualized instructional programs, although the extension to 
other areas, in some cases, is obvious. Our work also serves as a 
second response to some of the technical measurement problems posed 
by Harris, etal. (1974). 
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Definitions of Criterion-Referenced Tests and Measurements 
A criterion-referenced test has been defined in a multitude of 
ways in the literature. (See, for example, Glaser & Nitko, 1971; 
Harris & Stewart, 1971; Ivens, 1970; Kriewall, 1969; and Livingston, 
1972a.) The intentionally most restrictive definition of a criterion- 
referenced test was proposed by Harris & Stewart (1971): "A pure 
criterion-referenced test is one consisting of a sample of production 
tasks drawn from a well-defined population of performances, a sample 
that may be used to estimate the proportion of performances in that 
population at which the student can succeed [p.l]." On the other hand, 
possibly the least restrictive definition is that by Ivens (1970) who 
defined a criterion-referenced test as one "comprised of items keyed 
to a set of behavioral objectives [p. 2]." Given the current state of 
the art, Iven's definition would correspond to what we refer now to 
as an "objectives-based test" (Donlon, 197A; Millman, 1974) and this 

kind of test is not going to allow us to make the strongest kind of 
criterion-referenced interpretation, i.e. treat the score as an in- 
dication of the examinee's level of mastery in some well-specified 
content domain (Traub, 1972). A very useful definition has been 
proposed by Glaser and Nitko (1971): "A criterion-referenced test 
is one that is deliberately constructed so as to yield measurements 
that are directly interpretable in terms of specified performance 
standards." According to Glaser and Nitko, "The performance stan- 
dards are usually specified by defining some domain of tasks that 
the student should perform. Representative samples of tasks from 
this domain are organized into a test. Measurements are taken and 
are used to make a statement about the performance of each indivi- 
dual relative to that domain [p. 653]." 
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If one accepts the Glaser and Nitko definition of a criterion- 
referenced test, it is apparent that the test may be constructed of 
items from more than one domain. An assessment of mastery or an 
instructional decision for each individual is then made on the basis 
of the student's performance on items from each domain. Major interest 
thus rests on the reliability and validity of domain scores, (For ifiore 
on this, see Baker, 1974; Bormuth, 1970; Hively, Patterson, & Page, 1968; 
Glaser & Nitko, 1971; Millman, 1974; Popham, 1974; Skager, 1974.) 

Following the Glaser and Nitko definition, the construction of 
a criterion-referenced test requires the sampling of items from well- 
specified domains of items. The domain "may be extensive or a sin- 
gle, narrow objective, but it must be well defined, which means that 
content and format limits must be well specified" (Millman, 1974). 
The specification of the domain is crucial for putting together a 
criterion-referenced test since only then the criterion-referenced 
test scores can be interpreted most directly in terms of knowledge 
of performance tasks. It should be noted that the word "criterion" 
does not refer to a criterion in the sense of a normative standard 
but rather to the minimal acceptable level of functioning that an 
examinee must achieve in order to be assigned to a mastery state on 
each domain included in the test. Therefore, the term, domain-refer- 
enced test , may be less ambiguous than the term, criterion-referenced 
test . Furthermore, the term "criterion-referenced" may imply that 
the only use for the test is to make mastery decisions. Estimation 
of domain scores is another important use. 
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Distinctions Among Testing Instruments and Measuremen ts 

With the availability of a test theory for norm-referenced 
measurements (e.g., see Lord & Novick, 1968), we have procedures 
for constructing appropriate measuring instruments, i.e., norm- 
referenced tests. Do objectives-based programs which require 
different kinds of measurement (i.e., criterion-referenced mea- 
surement) also require new kinds of tests or will the usual norm- 
referenced tests with alternate procedures for Interpreting test 
scores be appropriate? There is little doubt that different tests 
are needed, constructed to meet quite different specifications than 
those typically set for norm-referenced tests (Glaser, 1963). How- 
ever, it should be noted that a norm-referenced test can be used 
for criterion-referf;nced measurement, albeit with some difficulty, 
since the selection of items is such that many objectives will very 
likely not be covered on the test or, at best, will be covered with 
only a few items. It has been noted by at least two writers (Mlllman, 
197A; Traub, 1972) -hat when items in a norm-referenced test can be 
matched to objectives, criterion-referenced interpretations of the 
scores are possible, although they are quite limited in generaliza- 
bility. A criterion-referenced test constructed by procedures espe- 
cially designed to facilitate criterion-referenced measurement can 
and sometimes is used to make norm-referenced measurements. However, 
a criterion-referenced test is not constructed specifically to maxi- 
mize the variability of test scores (whereas a norm- referenced test 
is). Thus, since the distribution of scores on a criterion-refer- 
enced test will tend to be more homogeneous, it is obvious that such 
a test will be less useful for ordering individuals on the measured 
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ability. In summary, a norm-referenced test can be used to make 
criterion- referenced measurements, and a criterion-referenced test 
can be used to make norm-referenced measurements, but neither usage 
will be particularly satisfactory. 

It has been argued that to refer to tests either as norm-refer- 
enced or criterion-referenced may be misleading since measurements 
obtained from either testing Instrument can be given a norm-refer- 
enced interpretation, criterion-referenced interpretation, or both. 
The Important distinction made was that between norm-referenced 
measurement and criterion- referenced measurement (Glaser, 1963; 
Hambleton & Novick, 1973), From a historical perspective, this dis- 
tinction was important since a methodology for constructing criterion- 
referenced tests did not exist, at least at the time of Glaser • 8 
article. Criterion-referenced tests were constructed In the same 
manner as norm-referenced tests, and as pointed out above, the usage 
was not satisfactory. However, in view of the recent developments in 
the field, it may not be misleading to label tests as either cri- 
terion-referenced or norm- referenced. In fact, given the operational 
definitions, the distinction between criterion-referenced tests and 
norm-referenced tests may not only be unambiguous but also meaningful. 

Further distinctions between norm- referenced and criterion-refer- 
enced tests and measurements have been presented by Block (1971), Car- 
ver (1974), Ebel (1962, 1971), Glaser and Nltko (1971), Harris (1974a), 
Hleronyraous (1972), Messlck (1974), and Popham and Husek (1969). 
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Estlmation of Domain Scores and Allocation 
of Individuals to Mastery States 

Assume that a criterion-referenced test is constructed by ran- 
domly sampling items from a well-defined domain of items. There are 
two basic uses for which the scores obtained from the criterion-refer- 
enced test are ideally suited. 

Supposing that a student has a true score n , defined, say, as 
the proportion of items in the domain of items that a student can 
correctly ansv^er, the problem is to obtain an estimate of his score 
IT based on his performance on a random sample of items from the do- 
main. (The true score need not be defined as the proportion o^' 
correct items. Other definitions may be suitable.) Millman (19 "^4; 
has aptly termed this the ''estimation of domain scores." (Other 
terms for domain score are "level of functioning score" and "true . . 
mastery score.") There are several approaches for the estimation 
of :t, and we shall return to a discussion of these estimates in a 
later section. 

The other use of the scores derived from a criterion-referenced 
test is consistent with the notion that testing is a decision pro- 
cess (Cronbach & Glaser, 1965)* It makes sense to assume that each 
examinee has a true mastery state on each objective covered in the 
criterion-referenced test* Typically, a cut-off score or threshold 
score is set to permit the decision-maker to assign examinees, on 
the basis of their performance on each subset of items measuring an 
objective covered in the criterion-referenced test, into one of two 
mutually exclusive categories - masters and non-masters « Here, the 
examiner's problem is to locate each examirce into the correct mas- 

o 11 
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tery category. For the purposes of this discussion, let us assume 
that there are just two mastery states: Masters and non-masters. 
(In a later section* we will extend the discussion to include the 
problem of assigning an examinee into one of k mastery states.) 
There are two kinds of errors that occur in this classification prob- 
lem: False-positives and false-negatives. A false-positive error 
occurs when the examiner estimates an examinee's ability to be above 
the cutting score when, in fact, it is not. A false-negative error 

occurs when the examiner estimates an examinee's ability to be below 
the cutting score when the reverse is true. The seriousness of making 

a false-positive error depends to some extent on the structure of the 
instructional objectives. It would seem that this kind of error has 
the most serious effect on program efficiency when the instructional 
objectives are hierarchical in nature. On the other hand, the ser- 
iousness of making a false-negative error would seem to depend on the 
length of time a student would be assigned to a remedial program be- 
cause of his low test performance. The minimization of expected loss 
would then depend, in the usual way, on the specified losses and the 
probabilities of incorrect classification. This is then a straight- 
forward exercise in the minimization of what we would call threshold 
loss. Complete details for assigning examinees to mastery states are 
described in a later section. 
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Test Development and Validation ' 

Introduction 

In this section of the monograph, we put forth procedures for 
constructing valid domain-referenced tests. Such tests are used for 
much different purposes than norm-referenced tests and, consequently, 
the procedures needed to develop and validate domain-referenced tests 
will also be different. 

In view of the purposes of domain- referenced tests presented 
in this monograph, content validity becomes the center of vali- 
dation concerns. While it is appropriate to study the other validites 
of a domain-referenced test, it is essential that the content validity 
be carefully established in order that the test yield meaningful 
scores. Indeed some aspects of the construction process also serve to 

content validate the test. The symbiotic relationship that exists 
between domain- referenced test construction procedures and content 
validity is illustrated by Jackson's (1970) remarks: 

. . , the term criterion-referenced [here, domain-refer- 
enced] will be used here to apply only to a test desi|;ned 
and constructed in a manner that defines explicit rules 
linking patterns of test performance to behavioral refer- 
ents. • . .The meaningfulness and reproducibility of test 
scores derives then from the complete specification of the 
operations used to measure the quantity involved." (p»3) 

Jackson's statement implies that a properly constructed domain- 
referenced test will res I in a meaningful score. Thus, the ques- 
tion of validity, specifically content validity, of a domain-refer- 
enced test can only be answered within the context of proper construction 
procedures. More specifically, the problem that is unique to domain- 
^ referenced tests is that of linking the test item to the behavioral 
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referent and this is a content validation problem. OsbUi.a (1968) stras- 
ses the importance of this aspect of domain-referenced testing when 
he made the following remark, 

"What the test is measuring is operationally defined by 
the universe of content as embodied in the item genera- 
ting rules. No recourse to response- inferred concepts 
such as construct validity, predictive validity, under- 
lying factor structure or latent variables is necessary 
to answer this vital question". 

While we agree in part with Osburn's position, we do not com- 
pletely reject the usefulness of such response-inferred concepts as 
predictive (or criterion) validity. These concepts will be discussed 
later in the monograph. 

At this point the reader should be reminded of the important 
differences between norm- referenced tests and domain-referenced tests. 
In general, the purpose of a norm-referenced test is to discriminate 
among Individuals on some ability continuum. In order to achieve 
this purpose there needs to be some variability In the scores. It 
is clear that without variability among the scores no discrimina- 
tions can be made. 

On the ether hand, in general, a domain- referenced test may be 
used to determine an individual's level of functioning or It may be 
used to make an instructional decision involving the student. Other 
test uses exist, such as evaluating instruction (Millman, 1974), how- 
ever, these uses will not be considered in this monograph. The essen- 
tial aspects of the domain-referenced test In terms of these two uses 
are that the test items reflect the criterion and that the items 
were sampled in an appropriate manner from the population of domain 
items. Variability is not a factor; all the individuals taking the 
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test could be at a very high level of f'.w. nning thus getting most 
or ail the items correct and thereby sig.. .icantly reducing the 
variability of scores. However, variabil;uy in domain-referenced 
testing is not a completely useless concept. Indeed, variability 
will be observed when the sample of examinees is heterogenous 
in terms of their ability to answer items from a given content do- 
main. By establishing a priori the composition of the examinee sample, 
the resulting variability will provide additional, helpful information 
for constructing a good domain-referenced test. 

It should also be noted here that the different uses for domain 
referenced tests do not have differential implications for the con- 
struction of the tests. Basically the same construction and content 
validation procedures are followed regardless of the intended use of 
the score. However, the intended use of the test will influence the 
number of items to be selected. This point will be discussed later. 

Domain-Referenced Test Construction Steps 

Introduction - There are six basic steps in constructing do- 
main-referenced tests: 1. task analysis, 2. definition of the con- 
tent domain, 3. generation of domain-referenced items, A. item anal-^ 
ysis, 5. item selection, and 6. test reliability and validity. These 
steps are in close agreement with the steps outlined by Fremer (1974). 
The remainder of this section will examine in detail each of the do- 
main-referenced test construction steps. These steps will be con- 
trasted, when appropriate, to the analogous norm-referenced test con- 
struction step. 

Task Analysis t A task analysis separates into manageable compo- 
nents the complex behaviors that are to be tested. Task analysis actu- 

9^ 15 
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ally precedes tiie tt-st construction process. In ^loin.iin-refereucovi 
testing; a task nnalysls provides a loe.ica^ basis upon whlcii the con- 
tent domain definitions may be developed. It puts into perspective 
the purpose of the test and the characteristics of the examinees. 

A simple example of a domain-referenced test task analysis mip.ht 
be a general behavioral objective statement. While behavioral objec- 
tives do not provide sufficient detail for writing, items, they can 
serve to delineate the general scope of the content domain. Once 
the task analysis is completed, the domain-referenced test develop- 
ment steps are a focussing and detailing process. 

Definition of the Content Domain . The focussing and detailing 
process referred to above is essentially defining the content domain. 
This particular step is the most difficult one as well as the most 
critical step in constructing a good domain-referenced test. Many 
approaches to defining a content domain have been suggested in the 
literature (Osburn, 1968; Hively, et al . 1973; Bormuth, 1970; Guttman 
and Schlesinger, 1966; Popham, 1974). 

Recall that a central factor of a doniain-referenced test is that 
its items are linked tv^ the co^' 'domain in such a way that respon- 
ses to the items yield inforr*at astery of that domain. How- 
ever, this essential fact is the so » of a significant difficulty. 
Put simply, the difficulty is in establishing a content domain that 
on the one hand permits explicit items to be written from it and on 
the other hand is not itself trivial (Ebel, 1971). Establishing a 
domain is a content specification problem and is closely linked to 
problems in the discussion that follows. 

Er|c 16 
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Our position is to seek a balance between those procedures that 
specify content via item generation rules (Bormuth, 1970; Hively, 
et al > 1973) and other procedures that begin with behavioral objec- 
tives too general to yield domain-referenced items. The reason for 
this position is that, first, content delineation that is item speci- 
fic is too restrictive to be educationally useful, and second, a mean- 
ingful domain-referenced interpretation of the scores is not possible 
with generally stated objectives. 

Specifically, we believe that Popham's (1974) notion of an ampli- 
fied objective provides an excellent balance between the clarity 
achieved with item generation schemes and the practicality of behav- 
ioral objectives. Thus, amplified objectives represent a compromise 
position in the clarity-practicality dilemma and as such, they are 
likely to represent the approach adopted by individuals interested 
in developing domain-referenced tests. The compromise seems essential 
since it does not appear likely that the notion of specifying content 
via the use of item generation rules will be applicable to many subject 
areas. Certainly to date little progress has been made along these 
lines although as Millman (197A) notes "The task is very difficult, but 
we have just not had enough experience constructing tests, such as DRT's, 
to know (the limitations of the approach]"* 

According to Millman (1974), "An amplified objective is 
an expanded statement of an educational goal which provides boundary 
specifications regarding testing situations, response alternatives 
and criteria of correctness." The amplified objective defines the 
content to be dealt with, the response format and criteria of correct- 
ness* The important aspect of these guidelines is that they are 
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specific ; it is not necessary, however, that they specify a homo- 
geneous content area. Specificity and homogeneity are different 
concepts. Miliman (1974) makes this point, "The domain beinp refer- 
enced by a criterion-referenced test may be extensive or a single, 
narrow objective, but it trust be well defined, which means that con- 
tent and formal limits must be well specified". 

An example of an amplified objective taken from Popham (197A) 

is: 

"When presented with a series of the following types of 
statements concerning U.S. - Cuba relationships, the 
learner will correctly identify those which are true: 

a. Economic : dealing with size of mutual imports of 
tobacco, rice, sugar, wheat for tlie period 1925-1955. 

b. Political : dealing with status of formal diplomatic 
relationships from 1925 to the present. 

c. Military : dealing with the post-Castro period em- 
phasizing the Bay of Pigs incident and the USSR mis- 
sile crises." 

Popham says that we may further "amplify" this objective by speci- 
fying the kinds of true or false items to be used. Further, it 
should be noted that even by limiting the set of mear4ingful test 
itens using amplified objectives there still exists the danger of 
developing a trivial set of items (Popham, 1974). 

Before examining the next step in domain-referenced test con- 
struction it would be worthwhile to note that the content domain 

defined for a norm-referenced test (that is, a test constructed to 
facilitate norm-referenced interpretations) would seldom be as ex- 
plicitly defined. However, It would be quite incorrect to state, 
as some writers have, that the content domain of items for a norm- 
referenced test is not well-defined. In many cases, it is very 
Q well-defined, but not to the same extent as is necessary for the 
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construction of domain-referenced tests. 

Generation of Domain- Referenced Items , Once the domain Is de- 
fined, the test constructor must generate test Items, If the domain 
were defined in a perfectly precise manner, then the item themselves 
would not need to be generated. The items would simply be a logical 
consequence of the domain definition. Unfortunately, however, such 
precision may never be achieved in practice and we must, therefore, 
generate items and then develop procedures to check the quality of 
these items. Examining the quality of the items falls under the 
next section, item analysis. 

Even without a perfectly precise specification of the content 

domain the test constructor should have an excellent idea of item 

content and format from the statement of the amplified objective. 

At this stage of the test construction process the item writer would 

study the amplified objective and generate a set of items that were 

Leliovcd to reflect the domain specified by the amplified objective. 

After generating a set of domain-referenced test items in this manner, 
it is necessary to determine the quality of the items through item 

analysis procedures described below. 

Item Analysis . Generally speaking, the quality of domain-refer- 
enced items ±6 determined by the extent to which they reflect, in 
terms of their content, the domain from which they were derived. 

Because the domain specification is never completely precise, we 
must determine the quality of the items in a context Independent 
from the process by which the items were generated. Specifically, 
what is needed are procedures that will determine the extent to 
which the Items reflect the content domain. 

ERLC 
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There are two general approaches that may be used to establish 

the content validity of domain-referenced test items. The first 

approach involves judging each item by content specialists. The 

judgements that are made concern the ext. nt of the "match" between 

the test items and the domains they are iesigned to measure. 

The second item analysis procedure is to apply suggested em- 
pirical techniques that have been frequently used in norm- referenced 

test construction along with some new empirical procedures that have 
been developed exclusively for use within criterion-referenced test 
development projects. However, it is important to state that we do 
not advocate the use of empirical methods to select items that would 
comprise a particular domain-referenced test. We take this position 
for two reasons. First, selecting items for a domain-referenced test 
on the basis of their statistical properties would destroy the require- 
ment that the iteps are representative of the domain of items. Hence, 
the proper interpratation of domain-referenced test scores would not 
be possible. Second, empirical methods provide useful information 
for detecting "bad" items, but the information by itself, is not suffi- 
cient to establish the validity of the domain-referenced test items. 
Here we highlight some of the important aspects of these two ap- 
proaches; a more detailed discussion may be found in Coulson and 
Hambleton (1974) and Rovinelli and Hambleton (1973). 

(a) Content Specialist Ratings . Probably the most common approach 
to item validation, although it is fraught with problems, involves the 
judgements of two content specialists. One suggested procedure is as follows: 
Ve first choose two independent and qualified content specialists to 
judge the quality of the i^ems. Concunently the test developer has 
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drawn up a sat of itenis to measure each of several amplified objec- 
tives. The rating data is gathered in the following way. A sheet 
is prepared with a brief paragraph on the top that describes the ob- 
jective. Below the description of the instructional objective a sin- 
gle question would appear. For example: 

Below are 10 test items that are believed to measure 
the instructional objective described above. Please rate 
each item on a scale from 1 to 4 according to the question 
below. 

"How appropriate or relevant is the item for the in- 
structional objective described above?" 

1. Not at all relevant 

2 . Somewhat relevant 

3. Quite relevant 

4 . Fxt remely relevant . 

The data collected from the two content specialists is arranged 
into a contingency table with general elen^ent p^^ equal to the propor- 
tion of items that were classified in category i (1, 2, 3, or 4 above) 
by the first specialist and category j by t'ne second. 

An intuitively appealing measure of agreement between the classi- 
fication of items made by the content specialists is 



where p^^ is the proportion of items place tl in the itli category by 
each content soeciallst and k(='4) is the number of categories. How- 
ever, this measure of agreement does not take into account the agree- 
ment that could be expected by chance alone, and hence does not seem 
entirely appropriate. The coefficient kappa introduced by Cohen 
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(1960) takes into account this chance agreement and thus appears to 
be somewhat more appropriate. 

One disadvantage to the approach discussed above Is that It 
cannot be used to provide explicit statistical Information on the 
agreement of judgements for each Item. With the availability of 
more content specialists ( I.e . , perhaps 10 or more), such informa- 
tion could be obtained. Indeed there exist a multiple of rating 
forms and statistics to assess the level of agreement among content 
specialists on the match between Items and objectives [for example, 
see Goodrnan and Kruskal (1954); Light (1973); Lu (1971); Maxwell and 
Pllllner (1968).] Applications of these statistics to problems of 
item validation have been described by Coulson and Hambleton (1974). 

(b) Empirical Methods . Empirical methods, such as using dis- 
crimination indices (Cox & Vargas, 1966; Crehan, 1974; Wedman, 1973), 

may provide useful information for detecting "bad" items. Indeed 
Wedman (1973) gives a compelling argument for using empirical proce- 
dures. He argues that even careful domain definition and precise 

item generation specifications never completely eliminate the subjec- 
tive judgments that, to grear ' and lesser degrees, influence the test 

construction process. In order to guard against this subjective ele- 

nent, albeit small, ve should complement the domain definition and 

item generating procedures with empirical evidence on the items. 

Essentially, empirical procedures involve the use of various 
item statistics that measure item difficulty and item discrimination. 
In all instances, for these statistics to be meaningful, it is nec- 
essary to have some item variability across examinees. 

There has been some discussion recently on the ma^^iter of item 
O and test variance with criterion-^^^^erenced tests (Haladjma, 1974; 
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Millman & Popham, 1974; Woodson, 1974). Our own viev, which is in 
agreement with Millman and Popham (1974) is that item and test vari- 
ance is unnecessary with a domain-referenced test. The "quality" 
of the test Is determined by the extent of the match between the 
items in the test and the domain they are intended to measure, and 
of course whether or not the items represent a random sample of 
items from the domain of items. From this point of view, item and 
test variance play no role in the determination of the validity of 
the test for estimating domain scores. On the other hand, one would 
expect some variability ot scores across a pool of examinees consisting 
of "masters" and "non-masters" and to the extent that there was no 
(or limited) variability we might suspect that something was ViTong 
with the test. The test ought to reflect some variability of scores 
across "masters" and "non-masters" groups although one would not select 
items to maximize this difference since this would distort the process 
of estimating domain scores. 

(bl) Standard Item Indices . There are a number of standard sta- 
tistical indices which appear to provide information which can be 
used to ascertain whether the Items are measures of the instructional 
objectives. When items in a domain are expected to be relatively 
homogeneous , and there are many times when this is not a reasonable 
assumption (Ilacready & Merwin, 1973), it has become a fairly common 
practice for the test developer to compare estimates of item difficulty 
parameters, or item discrimination parameters, or both. Since one 
would expect items measuring an objective equally well to have simi- 
lar item parameters, estimates of the parameters are compared to de- 
tect items that deviate from the norm. Such "deviant" items are given 
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carcful scrutiny* In particular, content specialists' judj^ents of the 

item are considered alonp, with the empirical evidence. If the items look 

acceptable, they are returned to the item domain* A more formal method 

of comparing item difficulty parameters is considered next. 

Brennan and Stolurow (1971) present a set of rules for identifying 

criterion-referenced test items which are in need of revision. The 
decision process which they established for deciding which items to 
revise can be used to determine item validity. However, our particular 
interest is with their procedure for comparing difficulty levels of items 
intended to measure the same objective. Brennan and Stolurow (1971) 
state that the item scores from criterion-referenced tests will most 
likely not be normally distributed. Therefore, in order to determine 
if the item difficulties are equal, they propose the use of Cochran's 
Q test. This statistic can be used to determine whether two or more 
item difficulties differ significantly among themselves. Cochran's 
Q is a test of the hypothesis of equal correlated proportions. For 
a large enough sample of examinees, Q is approximately distributed as 
a x*" variable with n-1 degrees of freedom where n is the number of 
test items. Rejection of the null hypothesis, however, provides no 
guidance as to which items are significantly different. This can be 
achieved by setting up confidence bands for each pair of Itetns. 

(t>2.) Item Change Statistic . The difference between the difficulty level 
of an item before and after instruction describes another item statistic 
that seems to have some usefulness in the validation of domain-referenced 
test items. However, an important point to note is that a large dif- 
ference between the pretest and posttest item difficulty is not necessary 
since items may be valid but because of poor instruction, there may be 
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very little change in difficulty level between the two test admini- 
strations. But an analysis of the change in item difficulty is an in- 
dication of the validity of the test items. Assuming instruction is 
effective, one would expect to see a substantial char.ge in item dif- 
ficulty, if the item is a measure of the intended objective. With 
several items intended to measure the same objective, one could also 
compare the item change indices for ths purpose of detecting items 
that seem to be operating differently thaff-the others. 

Popham (1971) has proposed a two pronged approach for developing 
adequate domain-referenced test items: An a priori and a posteriori 
approach. The a priori approach corresponds to the determination of 
validity by operationally generating items from an amplified objec- 
tive. The a posteriori approach consists of empirically determining 
whether or not items are defective. In his discussion of the a posteriori 
approach, Popham presented a r.e'.: means for empirically evaluating cri- 
terion-referenced test items. This procedure represents an extension 
of the item change statistic and consists of constructing the following 
fourfold table from the results of a pre-posttest administration of a 
set of items measuring an objective: 

Posttest 
Incorrect Correct 

Incorrect A B 

Pretest 

Correct C D 

A, B, C, and D represent the percentage of examinees obtaining each of 

the four possible response patterns for an item on the two test administrations. 
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One then computes tne median valuv* across items designed to measure the 
same objective for each of the four cells. These values are used as 
expected vfilues and a chi-squarc statistic is computed for each Item by 
comparing the observed percentages in th^ four-fold table with the expected 
values. 

This chi-square analysis is used to determine the extent to which 
the items are homogeneous. Popham states that this procedure was more ac- 
curate thrin visual scanning in locating the atypical items. Wliile Popham 
(1971) describes other descriptive statistics for use in item analysis, 
the chi-square analysis for detecting %ad" items seems to be the most 
promising of his suggestions. 

Item Selection . The next step in the test construction process is 
to select a sample of items from the population of "valid" items 
definifig the domain. 

A prior question to the selection of test items is the determination of 
test length. Since this issue is discussed in some detail in a later 
section , it suffices to say here that test length is specified to achieve 
some desired level of "accuracy" of test usage. The particular method of assessli 
ar.curacy is of course dependent on the intended use of the test scores- 
estimating domain scores or allocating examinees to mastery statc.o. (For 
example, see Fhaner, 197A, for an interesting solution to the latter 
prpblem,or Kriewall. 1969, 1972.) 

Item selection is essentially a straight forward process and involves 
the random selection of items from the domain of valid test items that 
measure the objective. In the case of a complex domain, the test developer 
may resort to selecting items on the basis of a stratified random sampling 
plan to achieve a "better" selection of items. It is precisely this 
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feature of random selection of items from a well-specified domain of items 
that makes it possible for ''strong'* criterion-referenced interpretations 
of the test score'^ (Millman, 1974; Traub, 1972). Clearly, it is exactly 
this kind of interpretation that so many educators desire to make. Failure 
to either completely specify the domain of items measuring an objective 

or to select items in a random fashion from that domain will vitiate 
against an appropriate criterion-referenced interpretation of an exam- 
inee's test performance. 

Test Reliability and Validity , The problem of establishing do- 
main-referenced test reliability will be considered in a later sec- 
tion of the monograph. 

If procedures described earlier are followed closely, content 
validity should be guaranteed. Nevertheless, It would be desirable 
to check the content validity and this can be done using a technique 
described by Cronbach (1971). 

The Cronbach method involves two independent test constructors 
(or teams of test constructors) developing a domain-referenced test 
from the same domain specifications. The two resulting tests are 
then administered to the same group of examinees and a correlation 
coefficient is computed between the two sets of domain-referenced test 
scores. The correlation coefficient provides a statistical indica- 
tion of the content validity of the test. 

The main disadvantage of this procedure is that it requires that 
two domain-refprenced tests be constructed. If the two tests were 
constructed along the guidelines suggested here, the correlation study 
would be rather expensive to conduct. 
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When the criterion-referenced tests are being used to make in- 
structional decisions, studies should also be designed to investi- 
gate their predictive validities. (For more on this, see Brennan, 
1974; Millman, 19740 



"It- 
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Statistical Issues in Criterion-Referenced Measurement 

Estimation of Kxami nee Donain Scores 

There are several methods available for the estimation of a 
domain score tor an individual. The basic problem is, given an 
examinee's observed score on a criterion-referenced test, to deter- 
mine his score l:ad he been administered a'l the items in the domain 
of itenis, 

(a) Proportion-Correct Estimate 

The sinplest and the most obvious estimate of the ith examinee's 
true mastery score, tt^, defined as the proportion of items in the 
domain of items measuring the objective that the examinee can answer 
correctly, is his observed proportion score, tt^. This estimate is 
obtained by dividing the examinee's test score, x^ (the number of 
items answered correctly), by the total number, n, of the items 
measuring the objective included in the test. Appealing as it may 
seem in view of the fact that the proportion-correct score is an 
unbiased estimate of the true mastery or domain score, this estimate 
is extremely unreliable when the number of items on which the esti- 
mate is based is small. For this reason, procedures that take Into 
account other available information in order to produce improved 
estimates, especially in the case when there are only few items in 
the test, would be more desirable. 

(b) Classical Model II Estimate 

One of the first attcrpcs to produce m estimate of the true 
score of an examinee us^r^ :.»e information obtained from the group 
to which an individual beion ;s was made by Kelley in 1927. This is 
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the well-knovm regression estimate of true score (Lord and Novick, 
1968, pp. 65), which is the weighted sum of tuo components - one 
based on the examinee's observed score and the other based on the 
mean of the group to which he belongs. Jackson (1972) modified this 
procedure for use with binary data, by transforming the test score 

into g^ via the arcsiae trans lorruiCion, known as the rioc;:.uir.-Tu..t;v 
transformation, given t>y 



I 'J (sin"^ j ""i + sin"^ f 'i-H X (1) 



As a result of this transformation, the true mastery score is trans- 



formed onto Yj^» where. 



- sin"^J Tt^ . (2) 



If .15 ^ ^ .85, and if n, the number of test items, is at least 
eight, then the distribution of is approximately normal with a 
mean approximately equal to the transformed true mastery score, y^, 
and known variance 



V - (4n + 2)""^ 



The model II estimate, or the Jackson estimate becomes, in terms of y, 
Y. = : + (An + 2)""^ g.] / U + (An + 2)"^] , (3) 



where g. , the sample mean based on a sample of N examinees is given by 
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N ^ Z g, . (4) 
i-1 



and 1, the sample variance of the y's, is given by 

-1^ 2 -1 

* = (N- - 1) :. (g. - g.) - v-n + 2) . 

i=l 

Once is obtained, tr^ is determined from the expression 



2 



- (1 + .5/n) sin - .25/n. 



(6) 



For a detailed discussion of this estimate, the reader is referred 

to Novick and Jackson (1974, pp. 352) and Novlck, Lewis, & Jackson (1973). 

(c) Bayesian Model II Estimate 

The Jackson estimate given above is not ideal since it does not 
take into account any prior information that may be available. In 
addition, it may happen that i estimated using (5) is negative, in 
which case the solution will not be meaningful. Novlck et al . (1973) 
utilizing the transformations (1) and (2) , obtained a Bayesian solu- 
tion for the estimation of the mastery score that not only takes into 
account the direct and collateral information, but also any prior in- 
formation that may be available. In addition, this procedure avoids 
the problem of negative estimates for 

Since the distribution of g^ has known variance but unknown mean 
Y^, the distribution of g^ is customarily expressed as a conditional 
distribution i.e.. 
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! Yi^N(Y., V) (7) 

where v) rt»presents the normal distribution with mean >^ and 

variance v. T!)e Bayesian estimates are based on the revised belief 
about the parameters after the data are obtained. The revised belief 
about the parameters after the data are obtained is summarized in the 
form of the posterior distribution of the p«irameters. 

As a consequence of Bayes Theorem, the posterior joint distri- 
bution h(Y^» ^2***** J I^sta), is readily expressed in terms of the 
prior distribution ^CYj* Y9»»»»» y^) as 

h(^r^,Y^,...,Y.^, ! Data) <^ g(Data | 7 .Y., . • • • .Yj^) f (Yj^ »Y2 » • • • » Yj,J • (8) 

The expression R(Data | Yj^»Yo» • »Yj^) is known as the likelihood func- 
tion and is a statement of the joint probability of observing the data 
conditional upon the unknown parameters Yj^.Y2»* • • •Yj^. The product of 
the N distributions given by equation (7), where N is the number of 
examinees in the sample, yields the likelihood function. 

In order to obtain the posterior distribution of y^» it is 
necessary to specify the prior knowledge about the distribution of Y^i^t 
or f (y2»Y2 » • • • »Y»J • In order to do this, it is assumed that the trans- 
formed **true" scores Y2»Y-,»* . • »Yj^j of the N individuals are exchange- 
able. This amounts to saying that the prior belief about one Y^ is no 
different from the belief about any other y^ and implies the assumption 
that Y^ is a random sample from some distribution. In particular, it is 

assumed that the prior disLrii>ution of y. normal with unknown mean ''t 
^ 1 

and unknown variance c* Thus, the specification of the prior distribu- 
tion of Yj is dependent upon the knowledge of the mean and the variance 
^ <t>. However, N'ovick et al > (1973) have suggested that the prior belief 
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about a may not be important as the specifications of the prior belief 
about ^ and may be represented by a uniform distribution. The above 
autnors have further assumed that it is reasonable to represent the 
belief about ^ by an in>^erse chi-square distribution with v degrees 
of freedom and scale parameter A (see Novick and Jackson, 197A, for 
an extensive discussion of this distribution). Specification of the 
prior belief about 0 thus requires the specification of only the two 
parameters, v and A. 

Kovick et al . (1973) have considered in detail the problem of 
setting values of the parameters, v and A, Based on various considera- 
tions, these authors recommend setting v = 8. The mean 4>, of the in- 
verse chi-square distribution is given by A / (v-2) , and once v is 
knoi.-n, X can be set equal to (v-2) To estimate 4> it is neceasary 
to indicate the amount of information that is available about tt. This 
is accomplished by specifying a value M, where M is considered to be 
the TT value of the typical examinee in the sample. The next step is 
to specify the number of test items, t, that would have to be 
administered to the examinee in order to obtain as much information 
about TT as is deemed to be available. Now, transformed estimates of 
71, from a t-item test are distributed normally on the y-metric with 
variance (At + 2) \ Hence, (At + 2) ^ can be taken as an estimate 
of (f. and subsequently A can be specified. 

Specification of v and X in essence determines the prior distri- 
bution f(Y) of Yj^t Y^. Substituting this in equation (8), 

Uovick et al . (1973) obtained the joint posterior distribution of the 
parameters, and hence the joint modal estimate of 

The joint modal estimate y^ is obtained by solving the equation 
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« 



(9) 



where 



-1 



N 





(10) 



i-1 " 

This equation for >. has to be solved iteratively, and has been found 
(Movick, et al> 1973) to yield a satisfactory solution after only a 
few iterations. 

(d) Marginal Mean Estimate 

The Bayesian model IT estimate discussed above is useful for 
making joint decisions about a set of N examinees. However, in cri- 
terion-referenced testing situations, separate decisions about each 
individual have to be made and hence separate or marginal estimates 
of true mastery or domain scores » are required. 

Lewis, Wang, and Novick (1973) have obtained a marginal mean 
estimate of the true mastery score, given by 



The quantity p* is dependent on the parameters v and \ and on the 
data; once the parameters are set, p* can be read directly from 



determined using equation (6) . 

(e) "Quasi" Bayesian Estimates 

In obtaining the joint modal estimates and the marginal mean 



Y. - g. + P*(g, - g-) 



(11) 



tables prepared by Wang (1973). Again, once is obtained tt is 
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estlmates, Novick, et al . (1973) and Lewis, et al . (1973) assumed 
that the prior beliefs about -i and could be expressed in the form 
of distributions. There are several variations to this theme. If 
instead of specifying the prior beliefs in the form of distributions, 
values for a and * can be specified on the basis of previous exper- 
ience, then the expressions corresponding to the Bayesian marginal 
mean estimates are readily obtained, and these estimates are rela- 
tively easy to compute. 

These estimates are based on the prior specification of a and <{>• 
Specification of ct introduces relatively few complications, but the 
exact specification of * poses a problem. This is not a quantity 
most practitioners are familiar with. However, the interrogation 
procedure described by Novick and Jackson (1974) can be effectively used 
to yield this information. These quasi-Bayesian estimates are derived on 
the assumptions that, 1- the prior belief about a can be expressed 
as a uniform distribution, and * can be specified exactly, and, 
2. both a and <{> can be specified exactly. In the first case, it 
can be shovm that the marginal mean estimate is given by 



' = 8i * (^n+2)"-'- g. 



(I2a) 



4> + (4n+2)- 

In the second case, the marginal mean estimate, y^, becomes 

- ^ g^j, + (4n-f-2)'^a ^ (12b) 
^ (fi + (4n+2)"^ 

The similarity between the marginal mean estimates (12a) and (12b) 
and the Jackson estimate (3) is obvious. In fact, it is interesting 
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10 note that the Jackson estiir.ate is in reality an empirical Bayej? 
estimate and a vorsiou of it iias been given by Rao (1^>65). 

Allocation of Examinees to Mastery States 

Let us consider now the situation where one is interested in 
assigning an examinee to one of several mastery states or categories. 
In view of the discussion in the last section, it may appear .empting 
to first estimate the examinee's domain score or mastery score, com- 
pare it with the cut-off scores, and then, in the case of two cate- 
gories, classify the examinees as either a master or a non-master. 
Unfortunately, this approach is not very satisfactory. The estimates 
fbr the domain scores may be based on a loss function completely in- 
appropriate for that associated with making decisions. For instance, 
the joint modal estimate and the marginal mean estimates are based 
on a zero-one loss function and a squared-error loss function, respec- 
tively. In making decisions, how far the examinee is from, say, the 
cut-off score is of no concern. Instead, the main concern is whether 
the examinee is above or below the cutting-score. Hence, an appro- 
priate loss function in the decision-theoretic process is the thresh- 
old loss function. This together with losses or costs associated 
with misclassifications make obvioa^^ the fact, that in order to 
classify students into categories, a decision-theoretic procedure 
has to be used. 

We shall first consider the problem of classifying an examinee 

into one of two categories. As in the previous section, the observed 

scores x^ are transformed into g^ by the arc sine transformation. 

Let yissltT^ /iT) denote the transformed domain score it, and Tf^ to be 

cut-off score. If Y (-sin""^ /iu) is the transformed cut-off score, 

o 

examinees with true scores y less than Y„ are classified as true non- 
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masters, and true masters otherwise. Conforming with the notation^ 
employed by Hambleton and Novick (1973) we define the two-valued 
parameter w to denote the mastery state of the e^^aminee. The para- 
meter 0) assumes one of two values, u)^ or li^^* examinee is a 



non-master, i.e., if y < Y » we set 

o 



to = OJ^, 



and if he is a master, i.e., y 1 Y^, we set 



Both Y and u are, of course, unobservable quantities. Our 
approach is to produce, using Bayesian statistical methods the post- 
erior distribution representing our belief about the location o^' the 
parameter y. Using this distribution and with a cutting score defined, 
ve can produce probabilities representing the chances of an examinee 
being located in each mastery state. 

In classifying an examinee the decision-maker may take one 
of two actions - retain the examinee for instruction or advance the 
examinee to the next segment of instruction. The action ^'retain" 
will be denoted by a^^ and the action "advance" by a^. The decision- 
maker can commit one of two kinds of errors. If the individual is 
in reality a non-master (in state cj^^) , the decision-maker can clas- 
sify the individual as a master (in state o.^) in reality the 
individual is a master (in state co^), the decision-maker can classify 
the individual as a non-master (in state coj^). In order to arrive at 
a rule for selecting actions a^^ or a^, it is necessary to specify the 
losses associated with these two kinds of misclassif ications. 

Conforming with the usage and notation of decision theory, we 
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shall employ the notation ^^(^^» ^j) to denote the non-nep,atlve loss 
function which describes the loss incurred when action a^ is taken 
for the individual who is in state u)^. Thus, 



and 



with 
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L(b}y a^) L(u}^, a^) 0. 

A good classification procedure is obviously one which minimizes 
in some sense or other the total loss incurred. That Is, we shall 
choosa that action for which the expected loss 



E L(a), a) 

is a minimum. 

We see that if action a^ is taken, then the expected loss, 
E L(a), a,), is given by 

E L(a), a.) « 0 • Prob[u) = a- ] •4- . Prob (o) - o)^] 

• i^^ ProblY 1 Y^!- (13) 

Similarly, if action a is taken, then the expected loss, E L(oj, a^,) 
is given by 

E L(a), a^) ■ Prob(a) « ] +0 • Prob[oj ■ oj^] 

- £^2 P'^o^fY < Yq]. (14) 
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We take action if 

E L((i), a,) < E L(u, a,) , 

or equivalently, if 

ProblY > Y ] < ^0 Prob[Y < Y^K (15) 
similarly, we take action if 

£^2 ^T^ohh < Y^l < ^r^^'fY lY^l- (16) 

If it so happened thai 

l.^ ProbfY < Y ] = J^oi ProbfY lY ], 
12 o 21 o 

we v?ould be indifferent as to vhich actiop to take. 

Swaminathan, Hambleton, and Algina (1975) generalized this two cate- 
gory problem to one where examinees are classified into one of several cate- 
gories. Suppose that there are k categories into which the examinees are 
to be classifie<) .ind consequently k actions to be taken. For example, 
when k==3, the docision-r :iKor nay be interested in classify inp, exam- 
inees as masters, partial riryters, or non-masiers. The appropriate 
actions may be to advance Ll.e rasters, retain the partial masters 
for a brief review and retain the non-masters for remedial vork* 

In order L(; separate exnr.inees into k catep.ories or k states, 
^1» ^2* * • •» need k-1 cut-off f:cores. Denote these by tt , 

^o2' • • •» ^ok-1* ^^^*^c^t an exanince is in state if his true 
proportion score ^ is less than -^^^^ in state ^2 if his score v is 
between tt^^^ and r^^* ^nd so on. In general an examinee is in state 
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u If ^ ^ V . In atiiiition* .d.-noic the r.ta of U nctionr, 

i oi-1 - t»i 

to be a^. . . . , , . . . , aj^. Action is U> be taUon if the 
examinee is claj.*ofitM! inlo slate u)^. 

Associated with nisclnssif icationb is the losr. function a^). 
If an action is labcn for an individual who in reality is in state 
u)^, the loss is so that 

These losses are conveniently displayed in Table 1. As before, we 
choose the action which has the smallest expected loss* Here &^ai.n 
we utilize the transformation presented in equation (1). 
For action , the expected loss is given by 

k 

E^L (u>. a^) l^^ Prob [y^^^ < Y < Y^^] (17(» 

where y _ = - and y t = + Thus action a, is chosen if 

oo ok j ' 

k * k 

The probabilities given in Equations (13) through (18) are 
really posterior probabilities and should be so stated. Thus, 

Prob [y^ 1 Y < y 

op-1 op 

in Equation (18) should be written as 

Prob (y 1 5 y ^ Y I Data) . 
op-1 op 

Once the posterior distribution of y is determined, the above prob- 
ability is determined as the area under the probability density 

Er|c ^^^'^^^ ^"'^ ^op- 40 
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The next stAj'C in the flecision theoretic process is to obtain this 
posterior distribution of parameter, y> for each individual, or, the 
posterior marginal distribution. The posterior joint distribution of 
the parameters, given the prior and the likelihood function, is ob- 
tained by using Equation (8) given previously. Once the joint dis- 
tribution is obtained, the marginal distribution is obtained by inte- 
grating out all the irrelevant parameters. 

Several procedures are available for the determination of post- 
erior marginal distributions and, hence, posterior marginal proba- 
bilities. The first method is that given by Lewis et al . (1973). 
Utilizing the distributions and assumptions given in connection with 
the Bayesian model II estimates in a previous section, Lewis et al . 
(1973) derived an approximation to the posterior marginal distribu- 
tion. They showed that the posterior marginal distribution of 
is approximately normal, i.e. , 




(19) 



where 



(20) 



and 



2 ^ 1 -f (N - 1) 0* 
^i (4n"-*- N 



(21) . 



(This approxinalion is rcnscnably good x.lion the nur.iher of te;;t items 
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oxcofds seven.) The quantity ^. is defined by Equation (4). The 

quantities p* and o*"" in expressions (20) and (21) are dependent on 

the parameters; v r«nd A of Cho inverse chi-s(;uare dliarirution of .^^ and 

have to be computed by numerical integration. As mentioned earlier, 

the tables prepared by Wang (1973) can be used so that on specifying 
2 

V and A, p* and a* may be obtained. 

Returning to the problem of classification of students into k 

mastery categories, we first transform thr (k-l) specified cut-off 

score TT into y , given by 
op op ^ ^ 

Y_ =^ sin"^ /tt ^ p l,...,k-l.. (22) 
op op 

The next step is to calculate the probabilities of the type given 
by Equation (16), (17), and (18). Tt is clear that for any examinee. 



Prob(TT , < TT < n i Data] = Probfy , < Y < Y I Data]. (23) 
op-1 — op op-1 — op 



For the ith examinee, we define the quantity z^j^ 

z »l2lU!i , (24) 
oji 

2 

with and defined by Equations (20) and (21). The quantity 

z^j^ is merely the normal deviate corresponding to the cut-off score 

j for examinee i. Since the posterior distribution is approximately 

2 

normal with mean u. and variance a, ^ 
1 i 7 

Prob[Y ^ 1 < < Y ! i>ata] « Prob[z « n < 7. < z . ] Data]. (25) 
op-1 — 1 op ' op-li — i opx ' 
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Thar is, the probability that is between Y^p^j and is approx- 
imately equal to the probability that a standard i/.ed normal variate 

is between the z scores z , and z Hence, for each examinee i, 

op-i op 

the quantity 

k 

E L(u,a.) = Z ? . Prob(z z . ; Data] (26) 

w J p=i PJ op-li i opi 

is calculated 'or each action j (j=l, 2,...,k). These k expected 
losses are than compared with one another, and the action for which 
the expected loss is the least is chosen as the appropriate action. 

In order to illustrate the procedure consider the following 
hypothetical example. The data and results for this example are 
summarized in Tables 2 and 3. 

Suppose that a set of 10 items representative of the domain of 
items measurinr, an objective is administered to a group of 25 exam- 
inees, and that the examinees are to be classified into one of three 
categories, masters, partial masters, and non-masters. The losses 
associated with wrongly clasntfying the examinee are given in Table A. 
Also, assume that the cut-off scores and •^^» 
respectively. First, the observed scores, x^ are transformed into 
g. , and the cut-off scores tt , and ^ ^ into y and y Next, the 

1 OX OL Ol OZ 

prior belief about <^ is specified. As indicated earlier, this is 
done by choosing v and a, the parameters of the distribution that 
is used to represent the belief about ^. In order to dotcrniine v 
and A, the length of the test that woul^i be required to vield as 
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State 



tosses for the Three-Action Problem 
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nucil inforniation as one feels one has about any examinee's true 

mastery score tt . is decided. Suppose that, it is decided that 
i 

a five-item test would be required. Hence, t=5 and, (4t+2) = .0454, 

is the value for ^. Since, in general, a good value for v is eight, 

the value for > is .2727 * (v-2) J]. The tables prepared by Vang 

2 

(1973) give p* = .5335 and a* = .0159. The next step is to compute 

and using equations (20) and (21). Finally, the standardized 
normal deviate given by equation (24) is obtained and using the 
tables of the standardized normal distribution the approximate prob- 
abilities, Prcb[7T^ < .6 ! Data], and Proh(.6 ^ t:^ < .8 I Data], 
Prob[TT^ > .8 I Data], are calculated. 

The hypothetical probabilities reported in Table 3 are the 
probabilities associated with an examinee being in any one of these 
three categories. These probabilities » when combined with the loss 
structure presented in Table 4, would result in examinees with 
seven or eight correct item?? being retained for a brief review nd 
examinees with a score of nine or ten items correct being moved 
ahead. 

The Bayesiaa method outlined above is one of several methods 
that could be used to provide the posterior probabilities necessary 
for the decision-theoretic approach. Other methods that could be 
used to produce the posterior probabilities can be developed along 
the lines indicated in the previous section. One obvious procedure 
is to obtain the posterior probabilities under the assumption that 
instead of specifying the prior beliefs about a and <{> i^^ the form 
of a distribution, thr parameters that characterize the distribution 
of Y^, values for ^ and 4^ can bo specified exactly. In this case, 
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tlie i>usterior mnrgtiini distribution of 7^ is normal with mean 



av + g^(j 



* + V 



and variance 



V -f g , 
va 



I a, , Data N(— — • 3L±^) 

1 <{> + V V({) 



(27) 



Once the posterior marginal mean and variances are obtained, the 
cut-off scores are transformed and the posterior probabilities ob- 
tained for each examinee. The expected loss for each action is ob- 
tained as given by Equation (26) and the appropriate decisions made* 

Another method for obtaining the posterior probabilities is to 
assume that the variance <t of the distribution of is specified 
exactly but that the distribution of a is uniform. This test amounts 
to saying that although we have prior beliefs about and we are ignorant 

about a. In this case, the posterior marginal distribution of'y^ is 
also normal, and is given by 



f I ^, Data 0, ^^^t-lhl . v(t + n"^)) 



(28) 



Again, the posterior probabilities are obtained in the manner described 
above, and the appropriate decisions made. 

The posterior marginal distribution can be obtained more directly 
if, instead of transforming the observed score x, into g by the arc- 
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sine transformation, we worked directly with the proportions. In this 
case, the Beta-binomia:. analysis outlined by Novlck and Jackson (197A) 
and Novick and Lewis (197A) can be utilized effectively to produce the 
posterior probabilities. For details of this procedure, we refer the 
reader to the above references. 

It should be pointed out that more recently Lewli-^, Wan^, and Nov- 
ick (197A) have developed an extension of the procedure for deriving 
the posterior marginal distribution by incorporating the prior infor- 
mation on the parameter a. They assumed, in addition to all the assump- 
tions made for obtaining the joint modal and marginal mean estimates, 
that 

cx N(vi, cf/n ) . (29) 

The quantity ^ together with u and the parameters X and v for specifying 
the distribution of have to be supplied by the user. This procedure 
shows great promise and needs to be studied carefully. 

Application of a Bayesian Decision-Theoretic Procedure 

The procedures described in the previous section should be feasible 
with objectives-based programs that have a small computer of the type 
typically used to manage instruction (see, for example. Baker, 1971). We 
shall attempt to demonstrate the feasibility of the procedure by briefly 
outlining the steps a hypothetical instructional de^jlrner wDuld take. 
Let us suppose t'nat an instructional designer is interestetl in making 
decisions on stuclonls* .statur.vjth rcsi^cct lo a particular sjct of 
program objectives. TcsL itcns r.canurinr. < ii objective arc organ- 
ized into a criLci lon-rofcrcnced te.sL and adinir.iMered to the tau- 
^ dents. We ar>5uir.c trhat the LcsL items are binaiv scored and represpnt 
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a raiulor.i sanplc of itc•^^s^ from ihe donain of ilci.s ll*.'>.l HHM^airc each 
objective. For . .tch object iv.» , tho closif;Kcr :iusi jtpecify tlie number 
and the loccaion of the rustcry states on the rM.aory rcore interval 
[0, 1], Thin is done by defining the cuttinf, f.corc-:. Tn addition, 
the instructional designer specifies the lorses attached to classifying 
an individual incorrectly. A loss matrix of the kind shown in Table 1 
is developed and provided to the cor.puter. Some rour,h pAiidelines for 
developing the loss naLrix have been described by lianbleton and Kovick 
(1973). Finally, it ii) necessary for the desir.ner to specify his prior 
beliefs about the distribution of ability on each ohjoctive covered in 
the lest. This is one step where the xnGtructional desi^'.ner needs 
to be extrcnely careful. The effectri of poor choice of priors on the 
decision process is not known at thin point, and it remains to be de- 
termined under what conditions a poor choice of priors will result In 
worse decisions than not using Baycsinn methods at all. Clearly, fur- 
ther research is necessary to develop efficient methods for accurately 
asser>sing prior beliefs. 

Using any one of a variety of input devices (i.e., optical scan- 
ning sheets, mark sense cards or computer cards) the examinee test 
item responses are read by the computer and the Bayesian decision theo- 
retic procedure implcnented. The computer program can be designed 
to provide the output necessary to monitor student progress through 
the instructional program. A statement of domain scores and mastery allo- 
cations on objectives for each student can be produced and this infor- 
mation can be used to guide a student through the next seginent of his 
instruction. 
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The decision-theoretic procedure outlined in the last section pro- 
vides a framework within which Bayesian statistical methods can be em- 
ployed with criterion-referenced tests to Improve the quality of decision- 
making in object ivcs-baised instructional pror.var.is. The incorporation 
of losses introduces the dcci.sion-r.;nkcr valuof; into the decision 
process. The I'^ayesian i^;ethods incorporate the prior knowU-dr.c of the 
decision r.iakc;r c\m\ uti]i".e the data fro!'i all exnr.Mr.ccJi, tli(»rehv effec- 
tively increase the anount of information tl;c- decision ndcr ha.s 
witliout requiring; the adir.iniscrat ion of ndditior..:! tor>t items. How- 
ever, it should be pointed out tliat re.search i*: nee(^ed to <^stab]ish 
the robustness of Che Hayesian statistical mod(?l with respect to devia- 
tions of th(» data from the underlying, assu.nj)t ions . also note that 
the Bayesian statistical nodcl di\scrU)ed in this monograph is only one of 
several models tltat could be used (for example, see, Ilovick and Lewis, 
197A, for another) within our deci s jon-theorct ic framework. Further 

study of tlicse a(Ulitiona] models would seem to be h1j;hly appropriate. 
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Selected Peychonietrlc Issues 

of fairly obvious concern for both the theory and practice of 
criterion-referenced measurement are the following issues: (1) concepts 
of error of measurement, (2) reliability, (3) determination of appro- 
priate test length, and (4) determination of cut-off scores. This section 
is intended to provide both a review and discussion of the literature con- 
cerning each of these issues. 

Concepts of Error of Measurement for Criterion-Referenced Tests 

A framework for discussing errors of measurement of criterion-referenced 
tests would need to include at lea:it tiiree dimensions. The first has to do 
with the use of the test: Estimation of domain score or allocation to mastery 
states; errors have to be defined differently for these two uses of the 
test. The second dimension is concerned with the particular view of prob- 
ability that one adopts,. If the view of subjective probability is adopted, 
the concept of error of measurement is related to the properties of the 
posterior distribution for the true score that is being estimated. If thm 
frequency view of probability is adopted, then the concept of error of 
measurement is related to the observed score distribution for the examinee. 
The final dimension concerns whether information about the error is desired 
for tiie Individual, the group or both. However, the discussion of measure- 
ment error will be principally In terms of the first dimension, although 
the latter two dimensions will be briefly rt-ferred ♦'o. 

Earlier in the monograph we identified two uses of criterion-referenced 
tests. In this section we shall first discuss the concept of error associated 
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witli estimating tne examinee's domain score. Many theorists in criterion- 
refert*iiced measurement have insisted that the items on a criterion- 
referenced test should be interpretable as a random sample from some 
domain of items that may be described with a high degree of specificity. 
They argue that when this situation obtains, the observed proportion 
correct score may be considered to be an unbiased estimate of the do** 
main score. The situation, in which tests are constructed by random 
sampling from a domain of items, is clearly one example of the class 
of situations for which generalizability theory was intended (Cronbach, 
Rajaratnam, & Gleser, 1963; Cronbach, Gleser, Nanda, & Rajaratnam, 1972) 
The brief treatment of generalizability theory given in chapter eight 
of Lord and Novick (1968), which is concerned with nominally (or ran- 
domly) parallel tests, is sufficient for our limited aims in this mono- 
graph. 

Lord and Novick (1968) discuss the notion of generic true score 
which ve shall use to define the domain score, n^, i.e., 

. E . (30) 

where Y is a random variable for examinee a defined over tests con- 
structed by random samoling of items and £ is the expectation operator. 
The generic error of measurement is 

'ja^ ^ja-\ C31) 

which is the deviation of the observed score* for examinee a on test j 
from his generic true score. The generic error of measurement is the 
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quanticy ot intercut Wiiea our purpose is to estimaLe the oxaminoe's iiom;un 
score since it contains intormation about tae accuracy of tUe domain score 
estimates. Lord and .;ovick (1965) give tne following Linear model for 
tae observed score 



where i is the mean of the jth test, i. is the interaction between 
J ja 

person a and test j and ^j^^j^j is the specific error of measurement on 
the kth replication of the test. This model implies the identity 

'-ja - -^jaCk) ^ - ^> ^ -ja ' ^^S) 

From the definition of generic error and this identity. Lord and Novick 
(1968) derive a number of interesting properties for e^^. One property 
is 

H = ^ , (34) 

tnat is, over randomly sampled tests tne expected value of the generic 
error of measurement is zero and iience the observed score is an unbiased 
estimate of the domain score. However, the expected value for any given 
sample of items over leplicatii^ns is given b>; 



i; ^ja = !; ^^ja(k) - S - - ^''^ 



= :j - u + ..^ . (36) 
Thus, on any administration of test j for person a there is a bias due to 
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cue tv.t aifticuity term, (t^ - y) , and the interaction term. It is clear 
taat estimating tnis bias s»iouid be one concern of the users of criLerion- 
refercnceu tei;Ls. 

Other important properties of tiie generic error of measurement may 
be enumerated. However, rather than listing these properties we refer 
the reader to Lord and iiovick and point out that the properties 

of interest depend critically on waetiier tlie investi[;ator is interested in 
group or individual error distributions, and whether the error is defined 
wita respect to replications or randomly parallel tests. 

having defined and discussed to some extent the error of measurement, 
the iuportant consideration of a loss function arises next. A loss func- 
tion may oe described as a function that weights the error incurred in 
estimating a parameter, and in this case the loss function weights the 
error of measurement incurred in estimating a domain score. If we de- 
cide tiiat the squared-error loss function provides a reasonable quantifi- 
cation of the loss incurred by the error of measurement, the procedures 
given in ciiapter eiglit of Lord and Novick (i^6d) will be useful to estimate 
parameters concerned with tne error of measurement. 

The above discussion implicitly assumes that the frequency view of 
probability is adopted. However, it is equally reasonable to consider 
the "error of measurement" from a subjective view of probability. Within 
the framework of subjective probability, philosophical considerations imply 
that the concern should be with the quality of information we have about 
the individual's true score rather than the "error of measurement." One 
method of quantifying the quality of infonn.ition is in terms of the limits 
of c percent highest density region of the posterior distribution of the 
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docain score. If we are satisfied with our knowledge that there is 

a c percent probability that r\ lies within these limits, then the 

a 

test is providing the information we desire. If the region is too 
wide, a longer test is required, while if the region is narrower than 
we require, a shorter test may be used. 

In the previous section we introduced a linear model to point 
out the possible bias in the estimation of an examinee's domain score. 
To discuss the issue within the framwork of subjective probability, we 
need to investigate the Bayesian procedures for the analyses of such 
linear models. The Bayesian models discussed earlier in the monograph 
may not be appropriate for this purpose since a linear model such as 
that given by Equation -(32) may not be implied by the Bayesian models. 
Therefore, we will not discuss the possibility of a bias in Bayesian 

estimators due to an unrepresentative sample of items. 

The second purpose of criterion-referenced testing is that of clas- 
sifying examinees into mutually exclusive categories or mastery states. 
As outlined earlier, typically k-L cut-off scores are specified to 
separate the examinees into k categories. In the case of a single 
cut-off score, the examinees with domain scores greater than the cut-off 
score have mastered the instructional material to a desired level of 
proficiency, while tiiose with domain scores below the cut-off score have 
not achieved tiie required level of proficiency. The problem is to use 
tae results of a criterion-referenced test to decide on which side of 
the cut-off score eacu examinee's domain score lies. 

There are at least two possible concepts for error of measurement 
when the purpose is to classify individuals into mastery states. The 
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firsL t t>nct?pt is ha.si-d ot\ the accuracy of decisions wnile the second con- 
cept is based on tiie consistency of decisions made on repeated adminis- 
trations of a criterion-referenced test. The concept of decision-making 
accuracy implies tiiat an error occurs whenever an individual is incor- 
rectly classified. A plausible loss function for tiiis error of measure- 
ment is the threshold loss function, however, Novick and Lewis (1974) 
suggest three additional loss functions tnat might be used: 

(1) A tnresnolo loss iui4».Lion with an indifference region 
in which tiierc is z^ro loss for false positive or false 
negative errors, 

(2) A negative squared-exponential loss used with the root 
arcsine transformation parameter* 

Y - sin ■ /r , 

(3) A cumulative Beta distribution loss function. 

From the concept of decision-making consistency it follows that 
errors should be defined in terms of inconsistencies in allocation 
of examinees to mastery states across repeated administrations of 
a criterion- referenced test. An error occurs if an examinee is 
classified in different mastery categories on different admini- 
strations of a criterion- referenced test. Here again a threshold 
loss function is a reasonable loss function. However, again addi- 
tional loss functions should be considered. In particular, the 
threshold loss function with an indifference region may be useful. 

It should be realized that the concept of error based on decision- 

58 



-57- 



making consistency is very different from tnat based on decision making 
accuracy* inconsistent classifications iraply that a misciassif icat ion 
nas occurred on one of tue classifications, but consistent classifica- 
tions ao not necessarily imply that accurate decisions iiave been made, 
for it is entirely possible to be consistently inaccurate, inaccurate 
but consistent aecisions may occur wiienever a Bayesian decision-theoretic 
procedure is used for classification. The choice of loss ratio, viola- 
tions of the Bayesian model assumptions, improper specifications of 
priors, and regression effects acting either alone or in conjunction, can 
create consistently inaccurate decisions. The possibility of consistently 
inaccurate decisions also occurs when the sample proportion correct score 
IS used to make classif icatory decisions. If we adopt the definition 
of error of measurement given by Equation (31) , then the covariance of 
the generic errors of measurement over examinees on two tests will in 
general be non-zero, even though the expected value of such covariances 
over all pairs of tests in an infinite population of tests will be zero 
(Lord 6 Novick, 196t>) . Since we have correlated errors, the possibility 
exists that consistently inaccurate decisions may be made on the basis 
of tne observed proportion correct score. 

Reliability of Criterion-Referenced Tests 

Lord and Novick (196b) point out that the standard error of measure- 
ment provides meaningful information about the degree of inaccuracy of a 
norm-referenced test only wnen we have knowledge of the observed score 
variance for the group we are interested in. If we do not, the reliability 
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coefficient provides more meaningful information. This state of 
affairs is a reflection of the relative interpretation of norm-refer- 
enced test scores. However, properly constructed criterion-refer- 
enced tests yield absolute interpretations and when we are estimating 
domain scores, a quantity such as the standard error of measurement 
will always provide meaningful information about the degree of inac- 
curacy of the test (Harris, 1972). Both the probability of misclassi- 
fication and the probability of inconsistent classification provide 
needed information about the "reliability" of the test. There 

have been several reliability indices proposed in the educational 
measurement literature that are related to decision-making accuracy 
and decision-making consistency, and some of these are discussed 
below. 

Suppose that we administer a criterion-referenced test to a pop- 
ulation of exaninees on two occasions and classify the examinees into 
one of k mutually exclusive mastery states at each administration and 
denote the proportion of examinees placed in the ith mastery state on 
the first administration and in the jth mastery state on the second 
adniinistration, by P^j* ^'^^ intuitively appealing measure of agreement 
between the decisions made on the two administrations is 

k 

.^ Pii ' - 

1=1 

whe^'e r.. is the proportion of examinees placed in the ith mastery state 
on both test administrations. However, as noted by Swaminathan, 
Hambleton, and Algina (1974), this measure of agreement does not 
take into account the agreement that could be expected by chance 
alone, and hence c?m lot seem entirely appropriate. The coefficient 
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ic introduced by Cohen (1960) takes into account this chance agreement 
and thus appears to be somewhat more appropriate (Swaminathan, et al > 
1974). The coefficient ic, an expression for reliability of criterion- 
referenced tests, is defined as 



K = (p - p ) / (1 - p ). (37) 
c c c 
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where p^, the observed proportion of agreement is given by 



P„ = / Pii. (38) 
i=l 



and p^, the expected proportion of agreement is given by 



Pc = " Pi- P-1 • <39) 



It should be noted that p^ and p^^ represent the proportions of ex- 
aminees assigned to the mastery state i on the first and second test 
administration, respectively. 

Since p^ is the observed proportion of agreement and is the ex- 
pected proportion of aRrecment» < defined in equation (37) can be thought 
of as the proportion of agreement that exists, over and above that which 
can be expected by chance alone. It should be stressed that ic is base<? 
on the observed and expected proportions along the main diagonal of the 
joint proportion matrix. It is unaffected by discrepancies that exist 
in off-diagonal entries (for a further discussion, see Light, 1973). 

The properties of k have been discussed in detail by Cohen (1960, 
1968) and Fleiss, Cohen, and Everltt (1969). It suffices to note here 
that the upper limit of < is + 1 and may only occur when the marginal 
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proportions for different administrations are equal. However, if any 
examinee is classified differently on repeated administrations, the 
value of ic will be less than +1. 

In the derivation of the k statistic, all inconsistent classifi- 
cations are weighted equally. The quantity k or weighted Kappa, 

w ^ 

which was introduced by Cohen (1968) represents an extension which 
permits differential weighting of different kinds of misclassif ica- 
tion. 

The work of Swamlnathan et al # (197A) clearly is based on the 
concept of reliability as decision-making consistency. Criterion- 
referenced test users who adopt these authors* concept and coefficient 
of reliability should keep firmly in mind that consistent decisions are 
not necessarily accurate decisions. Also, these authors point out that 
K is dependent on factors such as the method for assigning examinees to 
mastery states, selection of the cutting score, test lengtn and tlie 
heterogeneity of the group, hence, they recommend that when reporting 
<, other information such as cutting scores and student ability as meas- 
ured by the test be reported along with tne reliability index* 

Harris (197^b) introduced an index of efficiency for a mastery test, 
Harris argues that a necessary characteristic of a mastery test is that 
it should sort students into two categories and that if it is a valid 
test, it should sort students into the correct two categories, as de- 
termined by some criterion data. As a consequence, he proposes that, 
lacking criterion data, it may be informative to examine how well a test 
sorts students into mastery categories, where the cutting score for 
classification is some number of items correct. The index of efficiency 
is defined as 
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whicii is equivalent to a squared point biserial coefficient between total 
score and a dichotomous variable indicating criterion group, Harris (1974b) 
points out that the largest over all possible classifications of 

the examinees is an upper bound to the validity of the mastery test when 
validity is measured by an analogous index, 

Harris' discussion of the index of efficiency implies that it may 
serve as a coefficient of decision-making accuracy since, in general, a 

large indicates a high decision-making accuracy. However, u^, in- 

c 

terpreted as a coefficient of decision-making accuracy may be misleading 

in some situations. For instance, if all the examinees are say, masters, 

2 

may turn out to be relatively small even taough the decisions may 
be substantially accurate. Thus we would underestimate the utility of 
the test for making mastery decisions. A situation that plausibly occurs 
in criterion-referenced testing is to have the test scores have a 
bimodal distribution. Let us assume that two non-overlapping distribu- 
tions that accurately indicate mastery occur. If there is any within 
distribution variability, will be less than one, but we will be making 
accurate decisions on the basis of the test. While it is clear that 

c 

will be relatively large in this situation, it still underestimates the 

decision-making accuracy of the test. Finally it may be possible that 

in using compare the decision-making accuracy of two tests, in 

at least some cases, may be nigher for the test witn which we would 

make less accurate decisions. These difficulties arise because p^ is 

c 

based on a squared error loss function, whereas the threshold loss func- 
tion appears to be more appropriate when criterion-referenced tests are 
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used to make ruistery decisions. Tlius . although the applicability of 

2 . ^ 

10 a Rlngio test and its eafie of computation make it attractive, 
care in interpretation must be taken if an investigator adopts 
as a measure of decision-making accuracy. 

Another interesting suggestion for reliability estimation comes 
from the work of Livingston (1972a, 1972b, 1972c). He proposed a 
reliability coefficient which is based on squared deviations of scores 
from the cut-off score rather than the mean as is done in the deriva- 
tion of reliability for norm-referenced tests in classical test theory. 
The result is a reliability coefficient which has several of the im- 
portant properties of a classical estimate of reliability. In fact, 
it can be easily shown that the classical reliability is simply a spe- 
cial case of the new reliability coefficient. However, several psycho- 
metricians (e.g., Harris, 1972; Shavelson, Block, & Ravitch, 1972) 
have expressed doubts concerning the usefulness of Livingston's reus- 
ability estimate. For example, while Livingston's reliability esti- 
mate may be higher than a classical reliability estimate for a cri- 
terion-referenced test, the standard error of the test is the same, 
regardless of the approach to reliability estimation. Hambleton and 
Novick (1973) note that they feel Livingston misses the point for much 
of criterion-referenced testing. They suggest that it is not "to 
know how far (a student's) score deviates from a fixed standard." Cer- 
tainly, Livingston's definition of the purpose of criterion-referenced 
testing is different from the two primary uses reviewed in this mono- 
graph. In fact, we are aware of no objectives-based programs that use 
criterion-referenced tests in a way suggested by Livingston. 
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Determin ation of Test Lent^th 

As in classical test theory^ test length for a criterion-refer- 
enced test is set to achieve some desired level of "accuracy" v/ith 
the test scores. In the case where estimation of domain scores is 
of concern, the relationships among domain scores, errors of 
measurement, and test length as suiranarized in the item-sampling model 
are well known (Lord and Novick, 1968) and provide a basis for deter- 
mining test length. 

When using criterion-referenctjd tests to assign examinees to mastery 
states, the problem of determining test length is related to the size of 
misclassif icat ion errors one is willing to tolerate. One way to assure 
low probabilities of nisclass i f icat ion is to make the tests very long, 
however, since there are a relatively large number of tests aduunistered 
in objectives-based programs, very long tests are not feasible. 

Of course an additional constraint imposed on the determination 
of test length is the relatively large number of tests that are needed 
within an objectives-based program and s< :l v.(nild seem useful to 
study the problem of setting test lengths within a total testing pro- 
gram framework (see for example, Hambleton, 197A). 

There have been three approaches to the problem of determining 
test length reported in the literature. One issue that distinguishes 
the approaches is the concept of probability that underlies each 
approach. The Bayesian approach of Novick and Levis (197A) emplovs 
the subjective meaning of probability, while the approaches of Millman 
(1972, 1973) and of Fahner (197A) employ the frequency view of prob- 
ability. 

Millman (1972, 1973) considered the error properties of mastery 
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decisions made by comparing an observed proportion correct score with 

a mastery cut-off score. V.y introducinp. the binomial test model , one 

can determine the probability of misclassif ication, conditional upon 

an examinee's true score, an advancement score and the number of items 

in the test. (Advancement score is distinguished from cut-off score 

in the following way: The advancement score is the minimum number 

of items that an examinee needs to answer correctly to be assigned to 

a mastery state. The cut-off score is the point on the true mastery 

or domain score scale used to sort examinees into mastery and non-mastery 
states,) By varying test length and the advancement score, an 

investigator can determine the test length and advancement score 
that produces a desired probability of misclassification for a given 
domain score. The primary problem in applying the tables prepared 
by Millman (1972) is that one would need to have a good prior esti- 
mate of the domain score. Other problems have been suggested by Novick 
and Lewis (1974): They report that for certain combinations of cut- 
off scores and test length, changing one or both to decrease the prob- 
ability of misclassification for those above the cut-off score will 
actually increase the probability of misclassification for those 
below the cut-off score. In order to choose the appropriate com- 
bination of test length and advancement score^ one must have some 
idea of whether the preponderance of student^ are above or below the 
cut-off score and of the relative costs of r.isclassif ication. How- 
ever, the first requirement can only be satisfied with prior informa- 
tion on the ability level of the group of examinees. Novick and 
Lewis (1974) suggest that is would be useful to have some systematic 
way of incorporating prior knowledge into the test length determina- 
Q tion problem. ^ 
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Novick and Lewis (1974) provide such a metnod based on the Bayesian 

Beta-binomial model. i'neir approach may he described as follows: For a 

fixed prior, fixed cut-oft score, and fixed loss ratio, Identify those 
combinations of test length and advancement score that ''just favor" the 

decision Lo classify the examinee as a master. By "just favor" we mean 

that the difference in expected losses for a mastery classification and 

a non-mastery classification lies in tiie interval LO, -r], where r is set 

by the instructional designer. Then using the two criteria below choose 

the optimal combination of test length and advancement score: 

(1) Disregard test lengths that are absurd in the context 
that tiie testing takes place (in all cases test lengths 
less than 23 items are recommended) , 

(2) Choose a combination of test length and advancement score 
that will be reasonable for a class of appropriate prior 
distributions . 

Clearly the results of such a procedure are dependent upon the chosen 
prior distribution. In fact, because of criterion (2) above the results 
for any one prior distribution is dependent on the class of appropriate 
pr.iors. Novick and Lewis (1974) provide tiiese guidelines for choosing 
priors: 

(1) choose a prior sucii that lU') = " » 

(2) choose priors such that p(';n ) is just greater than .50, 

'J 

(3) choose a class of priors with properties 1 and 2 but which 
differ in tiieir variance. 

The results also depend on ttie loss ratio, and the general result is that 
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longer tests and higher advancement scores are required with j»reater 
loss ratios. Also, the results depend on the cut-off score but a 

general trend does not really emerge • 

Novick and Lewis (1974) mention the important trade off between in- 
structional timp ;md testing time. If instructional time is increased, 
the expected value of tlie prior distribution should increase. A prior 
with a greater expected value permits sliorter tests, or if tae tests re- 
main the same length this prior will, in general, reduce the risk of mis- 
classification. However, tne saving from either of the latter, or some 
combination thereof has to be balanced against the cost of additional 
instruction. 

Novick and Lewis make three summary remarks: 

(1) in most situations, a level of functioning of something less 
than .85 is satisfactory. A value as low as .73 would be 
liighly desirable. This could be accompl isiied by redefining 
the task domain sliglitly so as to eliminate very easy items. 

(2) [Instruction! should be carefully monitored so that expected 
group performance will be iust slightly higher than the 
specified criterion level. This will keep [instruction] time 
and testing time relatively short. 

(3) The program should be structured so that very high loss 
ratios are not appropriate. Tliat is, individual modules 
should not be overly dependent on preceding ones. 

As Novick and Lewis suggest, it remains to be determined whether 
these three concerns can be adequately handled within the context of 
objectives-based programs. To the extent that tLay can, the Novick- 
Levis results should be quite useful. Although it may be obvious, it 
is perhaps worthwhile to mention also that strictly speaking, the 
test length recommendations In Kovick and Lewis (1974) are applicable 
only if the Beta-binomial model is to be used in decision making. We 
just don't know how optimal the re'^.ommendations derived from the model 
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are for the other Bayesian models reported in the literature (Novick, 
et al > 1973; Lewis, et al > 1973, 1974). 

Fahner (197A) has proposed a procedure that is similar to that 
proposed by Millman but which avoids the formal dlffi'^ulty of esti- 
mating the value of an examinee's domain score prior to obtaining 
any data. Fahner* s approach is a modification of the procedure 
employed in significance-testing. The basic procedure is to deter- 
mine a critical score c and the test-length n such that 

o 

Prob[Y > c I tt] ^ a for all tt <: tt 
ga o 

and 

Prob[Y ^ c ] tt] <: 6 for all tt > tt , 
ga ' ' o 

where a and 6 are the largest acceptable risk levels and Y^^ is the 
observed domain score of examinee n on test g. Since it is not pos- 
sible to keep both a and B at acceptable levels when the number of 
items in the test is less than that in the domain, Fahner suggests 
specifying two values, t^^ and tt^, such that the errors in deciding 

TT > TT when in fact tt. < tt < t» , and $ t when in fact tt < tt < tt^, 
o loo o 2 

are not very serious. The interval f-^^, tt^] is thus an indifference 

region. Once and tt^ are specified, the normal approximation to 

the binomial distribution can be used to determine c and n , the 

o 

length of the test. 

A difficulty which is shared by the Millman, Novick-Lewis, and 
the Fahner approaches is the choice to work with the binomial model. 
We use performance on a random sample of items to generalize to per- 
formance on a domain of items. In studying the adequacy of the 
i\enerali2ation we may concern ourselves with the results that might 
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have occurred using different random samples of items. In this con- 
text the binomial error model is justified. However, if we concern 
ourselves with the results that might have occurred on a different 
administration of the same test, the compound binomial model is more 
appropriate, kliich kind of alternative results should we consider? 
We feel there is merit in studying the results that might have occurred 
on different administrations of the same test, since this is the only 
test on which decisions are actually made. There are two important 
implications of the choice of a model for measurement error. First, 
the errors of measurement derived from the compound binomial model 
are somewhat smaller than with the binomial model so that the recom- 
mendations based on the Beta-binomial may be quite conservative. 
(This is especially true when one recalls that Novick and Lewis 
(197A), in the interest of making unifonrt test length recommendations 
over a class of priors, have already provided conservative recommenda- 
tions.) Second, the possible bias of tlu- observed score as an esti- 
mate of the domain score and the effect of that bias on the likelihood 

function for the observed score has been ignored. 

An important problem related to test length, but which h<.s not been 

examined in the literature on criterion-referenced testing is the problem 

of allocating the total time available for testing to the various tests 

that are to be administered in the instructional program. 

Determination of Cut-off Scores 

The problem of determining cut-off ..ort-s is an extremely important 
problem for criterion-referenced testing .lUiiough it has received only limited 
attention from researchers. Perhaps the most important ramification of 
tne choice of cut-off scores is tne psychological effect it has on stu- 
dents, in addition, caangos in .the cut-off score affects the ^^reliability" 
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and the "validity" of the test scores. 

Millman (1973) considers tive factors ii\ the setting of cut-off 
scores: Performance of others, item content, educational consequences, 
psychological and financial costs, errors due to guessing and item 
sampling. 

With respect to "performance of others," Millman (1973) discusses 
two possible procedures. The first is to set tiie cut-off score so that 
a predetermined percentage of the students "pass." However, this pro- 
cedure is inconsistent with the philosopiiy of objectives-based programs 
and therefore it would not seem to be applicable. A second procedure is 
to identify a group of students who tiave already "mastered" the mater- 
ial. Tnis group is administered the test and the cut-off score is chosen 
as the raw score corresponding to a chosen percentile score. Again, 
the applicability of this procedure to most objectives-based programs 
seems dubious, but there may be some situations in which the procedure 
is reasonable. 

The second factor is "item content." This approach requires the in- 
structional designer to inspect tiie items and to determine the subjective 
probability that some sub-population of the students would get some sub- 
population of the items correct. (This includes the possibility of 
deciding that all students -^acuLd get a particular iteni correct.) I'assing 
scores are then determined by either a conjunctive or compensatory model. 
In the conjunctive model, multiple cut-off scores are determined as ex-^ 
pected scores within each item group, while for the compensatory model a 
single out off score is determined as liie expected value over all items. 
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YhLv apprv)acli does uavo some ri»levancy in objectives-based programs. 

iiie sCiiemes involved under tiie iieading "educational consequences" 
involve determining the cut-otf score that maximizes independent learn- 
ing criteria. Millman suggests, amongst otiier things, the guideline that 
higher cut-off scores are required for fundamental or prerequisite skills. 
He also ai gues that skills that are not prerequisite should not have 
cut-off scores. 

Consideration of psyciio logical and financial costs leads to the sug- 
gestion that a low cut-off score be set wlieu remediation costs are high. 
In situations with lower remediation costs or higher costs for false 
advancements, higher cut-off scores can he considered. The Bayesian 
approacii considers a fixed threshold score and varies tiie advancement 
score Lo contend with loss ratios, while Millman's approach leads to 
cnanging tlie threshold score itself. 

The last factor considered by Millman concerns error due to guessing 
and item sampling, he tentatively suggests a correction for guessing to 
contend with the guessing source of error. The error introduced by item 
sampling is a bias due to systematically disregarding some of the types 
of questions and content in tiie domain. Reasons for leaving such items 
out of the test may be difficulty of construction, inconvenience of ad- 
ministration, or simply ignorance of the extent of tt\c domain. Millman 
reasonably suggests adjusting the cut-off score for the bias, although 
he does not treat the question of determining the bias. He also does 
not explicitly consider the posiibility of getting a poor sample of 
items by random sampling. 

An empirical approach to tiie problem of studying tne effects of cut- 
off scores was completed by Block (1^72). He completed an interesting 
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study which was motivated in part by bormuth's (1971) contention that 
rational tecimiques of determining cut-off scores, that can be defended 
logically and empirically, must be developed and in part by Cahen's 
(197u) suggestion that one way the assessment of learning outcomes for 
an instructional segment can be accomplished is by examining how well 
the segment has prepared students for future learning. 

Tne learning materials in the experiment were three units of pro- 
grammed text material on matrix algebra topics appropriate for eighth 
grade students. Five experimental groups differed with regard to the 
mastery cut-off score set for the groups. The cut-off scores were .b5, 
.75, .bS, and .95. In a particular experimental group all students were 
required to surpass the cut-off score. This was accomplished by self- 
directed review sessions. An additiontil control group did not have a 
cut-off score establislied and w.is not permitted to review. 

Block (1972) studied the degree to which varying cut-off scores 
during segments of instruction influence end of learning criteria. Six 
criterion variables were selected for study: Achievement, time needed 
to learn, transfer, retention, interest, and attitude. The results are 
ratner interesting but somewliat limited in generalizability. The results 
revealed that groups subjected to higher cut-off scores during instruc- 
tion performed better on the achievement, retention, and transfer tests. 
On the interest and attitude measures, tiiere was a trend for interests 
and attitudes to increase until the .85 group and then to level off (it 
should be noted that the .75 gr )up fared ver- poorly on the transfer, 
interest and attitude measures^ suggesting 5 'me extra-experimental 
influence). Therefore, the re>>alts suggest that different cut-off scores 
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niay be necessary to achieve different outcome measures. 
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Tailored Testing Research 

The considerable amount of testing required to successfully 
implement objectives-based programs has been criticized, but to some 
extent this amount of testing can be Justified on the grounds that 
testing is an integral part of the instructional process. Nevertheless, 
research is needed on procedures that offer the potential for reducing 
time but which do not result in any appreciable loss in the quality of 
decision-making from test results. Earlier in the monograph we 
discussed the use of Bayesian statistical methods as a basis for 
improving estimation and decision-making. When it is possible to 
arrange the objectives of an objectives-based instructional program 
into learning hierarchies (White, 1973, 197A) another promisiug pro- 
cedure is that of tailored testing (Ferguson, 1969; Lord, 1970; 
Nitko, 1974). 

Tailored testing has been defined as a strategy for testing in 
which the sequence and number of tes^ items a student receives are 
dependent on his performance on earlier items. In testing objectives 
organized into a learning hierarchy, one can make inferences about 
student mastery of objectives in the hierarchy which have not been 
tested. If, for example, a student is tested and found to have pro- 
ficiency in a specified objective, all objectives prerequisite to it 
can also be considered mastered. If the examinee lacks proficiency in 
an objective it can be inferred that all objectives to which it is a 
prerequisite are also unmastered. 
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Work on tailored testing has only recently attracted the atten- 
tion of educational researchers* l^ile there were several studies in 
the 1950' s and early 1960's, Frederic Lord's recent work in improving 
the precision of measuring an examinee's ability while decreasing the 
amount of testing time (Lord, 1970, 1971 a, b, c) has done much to bring 
attention to tailored testing. Recently, Wood (1973) provided a com- 
prehensive review of this line of research. 

Ferguson's work in 1969 typifies a second \±xie of research on 
tailored testing. It is an adaptation of tailored testing to situations 
in which the testing problem is one of classifying individuals into 
mastery states rather than precisely estimating their ability. It is 
this second line of research that has direct application to testing 
problems in objectives-based programs. Ferguson (1969, 1971) was con- 
cerned with classifying students with respect to mastery or non-mastery 
at each level of proficiency on the learning hierarchy. To accomplish 
this, computer-based tailored testing was applied to a hierarchy of 
skills in an objectives-based curriculum. The routing strategy that 
Ferguson used was complex and required a computer to perform the actual 
routing. What he found was a 60% savings in time in the computerized 
administration using a variety of branched test models. A study of the 
consistency of classifying students with respect to mastery or non- 
mastery of specific obj*^ctives revealed that consistency of mastery 
decisions was higher when the decisions were made using tailored testing 
strategies than with a conventional testing procedure* The validity 
of the tailored testing approach was also found to be high. 
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Tn a recent study, Spineti and Hambleton (in press) investigated 
the interactive effects of several factors on the quality of decision- 
making and on the amount of testing time in a tailored testing situa- 
tion. To enable the study of a large number of tailored testing strategies 
in different testing situations, computer simulation techniques were em- 
ployed. Factors selected for study because they were considered to be im- 
portant in the overall effectiveness of a tailored testing strategy inclu- 
ded test length, cutting score, and starting point. (Test length is de- 
fined as the number of items administered to a student to assess mastery 
of an objective; cutting score is defined as the point on the mastery 
score scale used to separate students into mastery and non-mastery 
states; and starting point is the place in the learning hierarchy where 
testing is initiated.) Various values of each factor were combined to 
generate a multitude of tailored testing strategies for study with two 
learning hierarchies and three different distributions of true mastery 
scores across the hierarchies. (Of the many learning hierarchies that 
are available in the educational literature, the learning structures for 
hydrolysis of salts (Gagne, 1965) and addition-subtraction (Ferguson, 
1969) were selected. The two learning hierarchies are shown in Figures 
1 and 2.) The criteria chosen to evaluate the effectiveness of each 
tailored testing strategy were the accuracy of classification decisions 
relating to mastery, and the amount of testing time. 

The simulation results indicated that it is possible to obtain a 
reduction of more than 50% in testing time without any loss in decision- 
making accuracy, when compared to a conventional testing procedure, by 
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implementing a tailored testing strategy. In addition, the study of 
starting points revealed that it was generally best to begin testing 
in the middle of a learning hierarchy regardless of the ability dis- 
tribution of examinees across the learning hierarchy. In summary, it 
was dramatically clear from the numerous simulations, that there 
was considerable saving in testing time gained through implementing 
a tailored testing strategy. And, whereas the Ferguson tailored 
testing strategies could only be implemented with the aid of com- 
puter testing terminals, the Spineti-Hambleton tailored testing 
strategies were simple enough that they could be implemented in the 
regular classroom with the aid of a "programmed instruction type" 
booklet . 

Among the problems that remain to be resolved in the area of 
tailored testing research, two seem particularly important. The first 
involves an extension of the Ferguson and Spineti-Hambleton work. Of 
most importance we see a need for further study of routing methods and 
stopping rules. The Spineti-Hambleton study made use of only the 
simplest routing methods and stopping rules, therefore there is sub- 
stantial area (and need) for extensions. In addition, it would likely 
be useful to consider test models in the simulation of test data that 
incorporate a guessing factor since it is well-known that guessing plays 
a part in individual test performance. 

A second line of research would involve some empirical research on 
tailored testing in the schools. The design of such study would in- 
volve developing a programmed instruction booklet which would include 
test items designed to mei^ure specific ohjertives in a learning hierarchy, 
a self-scoring device, and routing directions. Among the factors that 
could be investigated in an empirical study are test length, mastery 
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cut-off score, and routing method. In addition, it would be inter- 
esting to study the merits, in terms of overall testing efficiency, 
of having individuals generate their own starting points for testing 
in the learning hierarchy* 
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Description of a Typical Objectives-Based Program 
Introduction 

As mentioned earlier in the monograph, the trend toward Individuali- 
zation of instruction in elementary and secondary education has resulted 
in the development of a diverse collection of attractive alternative 
models (Gibbons, 1970; Gronlund, 1974; Heathers, 1972), many which are 
objectives-based. According to their supporters, these models offer new 
approaches to student learning than can provide almost all students with 
rewarding school experiences. All of these models, as well as many others, 
represent significant steps forward in improving learning by individu- 
alizing instruction. They strive to involve the student actively in 
the learning process; they allow students in the same class to be at 
different points in the curriculum; and they permit the teacher to 
give more individual attention. 

To give the reader a flavor for the scope of criterion-referenced 
testing within an objectives-based program we have included a detailed 
review of the testing and decision-making procedures within the Indi- 
vidually Prescribed Instruction Program (Glaser, 1968). 

The Learning Research and Development Center (LRDC) at the University 
of Pittsburgh initiated the Individually Prescribed Instruction Project 
during the early 1960 *s at the Oakleaf School, in cooperation with the 
Baldwin-Whitehall Public School District never Pittsburgh. Major 
contributors to the project over the years have Included Robert 
Glaser, John Bolvln, C. Lindvall, and Richard Cox. As of 1974, the 
IX'I program has been adc ^fed by over 250 schools around the country. 
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Instructlonal Paradigm 

It is instructive, first of all, to describe the structure of the 
mathematics curriculum. Cooley and Glaser (1969) report that the mathe- 
matics curriculum consists of A30 specified instructional objectives. 
These objectives are grouped into 83 units. (In the 1972 version of 
the program, there were 359 objectives organized into 71 units.) Zach 
unit is an instructional entity, which the student works through at any 
one time. There are 5 objectives per unit, on the average, the range 
being 1 to 14. A collection of units covering different subject areas 
in mathematics comprises a level; the levels may be thought of as roughly 
comparable to school grades. For illustrative purposes, we have presented 
in Table 5 the number of objectives for each unit in the IPI mathematics 
curriculum. 

The teacher is faced with the problem of locating for each student 
that point in the curriculum where he can most profitably begin instruc- 
tion. Also, the teacher is responsible for the continuous diagnosis of 
student mastery as the student proceeds through his program of study. 

At the beginning of each school year, the teacher places the stu- 
dent within the curriculum; that is, the teacher identifies the units in 
each content area for which instruction is required. After completing 
the gross placement, a single unit is selected as the starting point for 
instruction, and a diagnostic instrument is administered to assess the 
student's competencies on objectives within the unit. The outcome of 
the unit test is information appropriate for prescribing instruction on 
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TAULE 5 



Sumbcr of Objectives for Each Unit m the IPI Mathematics Curriculum^ 



Content Area 








l*eveU 










A 


B 


C 


D 


E 


F 


G 




V — 

Numeration 


12 


10 


8 


8 


8 


3 


8 


4 


Plarr V.ilue 




3 


5 


10 


7 


5 


2 


I 


Adililion 




10 


5 


8 


6 


2 


3 


2 


Siil>l faction 






4 


6 


3 


1 


3 


I 


Mull iplirnlion 








8 


U 


10 


6 


3 


1 >l Vl^l< Ml 








7 


7 




5 


5 


( 'oilllMll.ltlOlt (»f PrtlCl*' .I'K 






(> 


r» 


7 


1 


f) 


0 


I r t( 1 It Ills 


:t 


2 


•t 


6 


G 


14 


5 


2 


Money 




4 


4 


C 


4 


1 






Time 




3 


2 


7 


9 


5 


3 


1 


SvNlem:. of Nio.isuroment 




•\ 


3 


5 


7 


3 


2 




Geometry 




2 


2 


3 


9 


10 


7 


9 


Special Topics 






1 


3 


3 


0 


4 


5 



^ Reproduced t>y permi>vi<,M fri»m Luulv.ill. Tox. and Bolvin ( 10^0) 
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each objective in the unit. In addition, it is also necessary to select 
the particular set of resources for the student. In theory, resources 
that match the individual's "learning style" are selected. Within each 
unit, there are short tests co monitor the student's progress. Finally, 
upon completion of initial instruction in each unit, assessment and diag- 
nostic testing takes place. In the next section, the tests and the 
mechanisms for making these decisions are reviewed. 

Testing Model Description 

Various research reports over the last couple of years have dealt 
with the testing model and i^'s development (Cox & Boston, 1967; Glaser 
& Nitko, 1971; Lindvall et al., 1970). A flow chart of the testing 
model is presented in Figure 3. To monitor a student through the 
program the following criterion-referenced tests are used: Placement 
tests, unit pretests, unit posttests, and curriculum-embedded tests. 
All of the tests are cri terion- referenced, with student performance 
on the tests compared to performance standards for the purpose of 
dec i sio n-making . 

Let us now consider in detail the four kinds of tests and the 
method for student diagnosis. 

Placements Tests When a new student enters the program, it ^is 
necessary to place the student at the appropriate level of instruction 
in each of the content areas. (Glaser and Nitko (1971) called this 
stage-one placement testing.) Typically, this is done by administering 
a placement test that covers al.l of the subject areas at a particular 
level (see Tabic 5). Factors affecting the selection of a level for 
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Placement Test 
Taken 



<One specific unit \ 
selected for study J 



Unit Pretest 
Taken 




Psss 



all sklllsV r Fsll one or^ 
J \ «ore skills ' 

^ , 

(Prescription developed T 
for one skill in unit f 



Student works on 
instructionsl aateriala 
for one aklll 



CET for akill 
taken 



^ Psss CET ^ ^ Fsll CET 



(Psss CET for last \ 
uflMstered skill I 
~~r — 



Unit Posttest 
Tsken 



/Pass all skillsV Fsil one or A 

\^ J \ More skills j 



Figure i Flowchart of bteps m monitoring student progress in the IPI 
(Reproduced, by permission, from Lmdvall and Cox, 1969 ) 
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placement testing of a student include student age» past performance » 
and teacher judgment. Generally, the placement test covers the most 
difficult or most characteristic objectives within each area. Placement 
tests are administered until a unit profile identifying a student's 
competencies within each area is complete. At present, the somewhat 
arbitrary 80-85% proficiency level is used for most tests in the IPI 
system. 

Student test scores on items measuring objectives in each unit 
and area in the placement test are used to develop a program of study. 
The standard procedure is to assign a student to instruction on units 
in which placement test performance on items measuring a few representa- 
tive objectives in the units is between 20% and 80%. If the score is 
less than 20% for a given unit, the unit test in the area at the next 
lowest level is administered and the same criterion is applied. In 
the case where a student has a score of 80% or over, testing the unit 
in the area at the next highest level is initiated. (Further informa- 
tion is provided by Lindvall and Cox, 1970; Weisgerber, 1971; and 
Cox and Boston, 1967.) 

In suonnary, we note that the placement test has the following 
characteristics: It provides a gross level of achievement for any 
student in the curriculum, and it provides information for proper place- 
ment of students in the curriculum. 

Unit Pretests and Posttests . Having received an initial prescrip- 
tion of units, a student proceeds next to take a pretest for a unit at 
the lowest level of mastery in his profile. (GJaser and Nicko (1971) 

call this stage-two placement texting.) 
♦ 
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A student is prescribed instruction in each objective in the unit 
for which he fails to achieve an 85% mastery level on the pretest. A 
mastery score on each objective for a oiudent is calculated as the per- 
centage of items on the test measuring the objective that the student 
answers correctly. In the case where the student demonstrates mastery 
of each objective, he is moved on to the next unit in his profile, 
where he again takes a pretest. 

The unit post tests a.e simply alternate forms of the unit pretests 

and are administered to students as they complete instruction on the 

t:nit. A student receives a mastery score for each objective in the 

« 

unit. He is required to repeat instruction on any objective where 
he fails to achieve an 85% mastery score. The student is directed to 
the next unit in his profile if he demonstrates mastery on each objec- 
tive covered in the unit posttest. The next unit prescribed is almost 
always one at the lowest level of mastery (or grade level) • Those who 
repeat instruction on one or more of the objectives must take the unit 
posttest again before moving on in their program. 

Let us briefly consider the losses involved in making different 
decisions on the basis of unit testing data. It should be recalled 
that the unit tests are used to measure student performance on 
each objective or skill included in the unit with several test items. 
A student who is mistakenly assigned to a mastery state on an 
objective covered in the pretest will not likely have the same error 
in assignment based on the posttest, and so, on the basis of his posttest 
performance, the student will be assigned instruction on the objective. 
However, to the extent that the objective is a prerequisite to other 
objectives in the student *s program of study on the unit, he is goinR 
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co have sone instructional problems. Perhaps tins is one place where 
Bayesian statistical procedures might he useful. They co-Id be used 
to px'oduce an '^improved'* profile of test scores across the objectives 
measured by the unit pretest.. Essentially, test performance on an 
objective that was not consistent with the performance on other 

ctives in the unit could be modified somewhat. On the average, 
better mastery-type decisions would result. Likewise, this strategy 
could be used on the unit posttests. 

As far as assigning a student to instruction on objectives he 
has already mastered, it should be noted that this is likely to be 
frustrating to the student; however, the majority of false-negative 
errors occur because students are close to the cutting score* 

False-positive errors on the posttest are important if the objectives 
on which errors are made are prerequisites to other objectives in future 
units. It should be added that false-positive errors seem to be less 
serious if they are made on objectives that are terminal objectives 
(i.e., an objective is terminal if it is not a prerequisite to any 
other objective in the program). 'As compared to false-positive errors, 
false-negative errors are correspondingly less serious because the 
student can quickly move through the remedial materials and retake 
the posttest. 

In summary, pretests anc^ posttests are available for each unit ot 
instruction. The proper prete:^t: is administered on the basis of a 
student's curriculum profile, and learning tasks for each objective 
(or skill, as it is called in the IPI program) within the unit are 
assigned (or not assigned) on the basis of a student's performance 
on items measuring the objective. 
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Curricu lLira-r.mbL Hide d I \r>is. As the student pro^'ceds through a 
unit of instrut I ion, his progress is monitored This is done by the 
use oi curriculum-embedded tests (GET). As used in the mathematics 
IPI program, a CET is primarily a measure of performance on one 
specific objective. There are usually several test items to measure 
the objective. A review of the CETs in Level E of the program revealed 
that there are, on the average, about three items measuring the primary 
objective covered in the GET. The range is from two to five items. 
If a student receives a score of 85%, he is permitted to move on to 
the next presecribed objective. Otherwise, the student is sent back 
for additional work before taking an alternate form of the CET. 

A second purpose of the GET ii> to assess, albeit in a fairly 
crude way, whether or not the student has mastered the next objective 
in the specified sequence for studying the objectives covered in the 
unit. If the second objective iiicluded in the GET is not one the 
student has been assigned to study, he is moved on to be pretested 
on the second half of a CET that covers the next objective in the ^ 
student's program of study. Regardless of which CET a student takes, 
if he scores above 85% on the items tested, instruction on the objective 
is not required. Essentially, this means that a student must score 
100% since there are normally only about two items included in the 
test to cover the second objective. This additional pretesting 
of an objective in the GET gives students a chance to demonstrate 
mastery of new skills not specifically covered in the instruction up 
to that point and to eliminate that instruction from his program. 
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Summary and Suggestions for Further Research 
Ihc successful implementation of obj ectiver>-based programs depends, 
in part, upon the availability of appropriate procedures for developing 
and utilizing criterion-referenced tests for monitoring student pro- 
gress. The organization and discussion of the available literature 
on topics such as the uses of criterion-referenced tests, test deve- 
lopment, statistical issues in criterion-referenced measurement, validity, 
reliability, and tailored testing, provided in the monograph, should 
facilitate the continued development and improvement of criterion- 
referenced testing in the field. Remaining to be resolved, however, 
are many technical and practical issues. Let us consider the tech- 
nical issues first. 

First, we are quite enthusiastic about the contributions of 
Bayesian methods for improving estimation of domain scores and al- 
location of examinees to mastery states problems, and there is a growing 
number of impressive results to support cur enthusiasm (for example, 
Novick and Jackson, 197A; Novick and Lewis,. 197A) . However, we still 
have some concerns about the overall gains that might accrue in view 
of the complexity of the procedures, the robustness of the Bayesian 
models in testing situations where the underlying assumptions of the 
model are not met (for example, when one has very short tests), and 
the sensitivity of the Bayesian models to the specification of 
priors. We note that several of these concerns have been addrf:ssed, 
in part, by Lewis, Wang, and Novick (197A) and we are aware of other 
studies in progress that also address our concerns. 
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A second problem, which has not been studied at all in the con- 
text of criterion^referenced testing., Is an Instance of the band- 
width-fidelity dilemma (Cronbach & Gleser, 1965). With a variety of 
decisions oi varying importance to be made in an individualized in- 
structional program and with a limited amount of testing time available, 
how does one go about determining the "best** distribution of testing 
tine? Does one try to collect considerable test data to make the 
few most important decisions, or does one try to distribute the avail- 
able testing time in such a way as to collect a little information 
relative to each decision? A solution to this important problem 
is required for an efficient testing program. Determination cf test 
lengths for each domain without regard for the size and scope of 
the total testing program could produce a serious imbalance between 
testing and instructional time. Hambleton and Swaminathan (in pro- 
gress) are studying the problem of distributing testing time across 
a wide variety of tests (where the tests vary in reliability, validity, 
and importance to the testing program). The main problem that arises 
is that it is difficult to obtain a suitable criterion to reflect 
the **ef f ectiveness** of the testing program. 

Third, within objectives-based instructional programs where the 
objectives can be arranged into learning hierarchies, the strategy 
of branched testing would seem to offer considerable potential for 
decreasing the amount of testing while improving its quality. Some 
of the practical problems have been resolved in the Pittsburgh IPX 
Program so that the technique can now be used on a limited basis. 
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Nevertheless, many problems remain before adoption should or can pro- 
ceed within other programs. For example, it would be necessary to 
develop a nonautomated modified version of branched testing for schools 
without computers. Also, we need to know much more than we know 
now ab. at setting starting places, step sizes, stopping rules, etc., 
before we can effectively use branched testing In an instructional 
setting. 

Finally, there are many us^d for criterion-referenced tests 
besides the two studied in our monograph. And so it remains to pro- 
vide a similar review and integration of technical contributions 
for these uses. For example, the use of criterion-referenced tests 
in program evaluation will most likely involve methods of item seiec- 
tion and test design different from those mentioned in this monograph. 
It appears that the methods of matrix sampling could be employed 
very effectively for item selection in the context of program evaluation. 

It seems clear at this point in time that we have sufficient 
theory and practical guidelines to implement a highly efficient criterion- 
referenced testing program within the context of objectives-based 
programs. However, to date, no one has come close to implementing 
such a testing program. Ariong the questions that stand in the way 
of the successful implementation of such a testing program are the 
following: What skills do classroom teachers need to have in order 
to implement a criterion-referenced testing program with all of the 
special refinements (e.g., Bayesian methods, tailored testing, etc) and 
how should we train them? Will it be possible to develop domain spe- 
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clfications in content areas besides mathematics? Even in the area 
of mathematics where most of the important work has been done (see for 
example, Hively, et al , 1973) there have been questions raised about 
the extent to which the notion of domain specifications and subsequent 
test development can be extended to the more complex mathematics objec- 
tives. Another question has to do with whether o' not the details of 
the Bayesian decisior.-theoretic procedure for allocating examinees to 
mastery states can be put in a form that teachers will understand and 
be able to implement. For example, can we train teachers to specify 
their prior beliefs about abilities of examinees and losses associated 
with misclassification errors? Prior information for a Bayesian 
solution might include the student's past performance in the program, 
scores on other objectives included in the test, the overall performance 
of the group of students, etc. It is critical that such details be com- 
pletely checked out for their appropriateness and presented in a clear 
form to the teachers. 
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